Dataset - B2FIND

NCSE v2.0: A Dataset of OCR-Processed 19th Century English Newspapers

NCSE v2.0 Dataset RepositoryThis repository contains the NCSE v2.0 dataset and associated supporting data used in the paper "Reading the unreadable: Creating a dataset of 19th...

Dataset for color terms, 2012

This dataset comprises adjective-noun phrases with color terms.

AMR parse quality prediction [Source Code]

Accuracy prediction for AMR parsing predicts 33 accuracy metrics for a given sentence and its (automatic) AMR parse Abstract (Opitz and Frank, 2019): Semantic proto-role...

NLP in Diagnostic Texts from Nephropathology [Research Data]

This data set contains all annotated topic word tables from the work "NLP in Diagnostic Texts from Nephropathology", as well as all pre-processed and tf-idf-vectorized text...

Movie Title Puns

Context The data is based on the following paper on pun generation: Hämäläinen, M., & Alnajjar, K. (2019). Modelling the Socialization of Creative Agents in a...

WebStylo

Web based, open stylometry system based on Multilevel Text Analysis. Runs cluto and stylo (R system) clusterisation methods. Based on Natural Language Processing Workflow...

Cinderella - tool for Clustering and Classifications of Texts in Polish

System for clustering and classifications of Texts in Polish. Source code.

Chunker WS

Chunker-WS provides shallow parsing of Polish. The parser may be run against plain text (input format: text, then it runs WCRFT for tagging) or already tagged input (other input...

ChunkRel WS

ChunkRel-WS is a prototype service for recognition of three syntactic relations between chunks. The service may be run against plain text (input format: text), then the...

OpenLegalData (2022 - Corpus)

OpenLegalData is a free and open platform that makes legal documents and information available to the public. The aim of this platform is to improve the transparency of...

CorpusExplorer

Software for corpus linguists and text/data mining enthusiasts. The CorpusExplorer combines over 45 interactive visualizations under a user-friendly interface. Routine tasks...

MSTperl parser

MSTperl is a Perl reimplementation of the MST parser of Ryan McDonald (http://www.seas.upenn.edu/~strctlrn/MSTParser/MSTParser.html). MST parser (Maximum Spanning Tree parser)...

MSTperl parser (2015-05-19)

MSTperl is a Perl reimplementation of the MST parser of Ryan McDonald (http://www.seas.upenn.edu/~strctlrn/MSTParser/MSTParser.html). MST parser (Maximum Spanning Tree parser)...

DZ Interset

DZ Interset is a means of converting among various tag sets in natural language processing. The core idea is similar to interlingua-based machine translation. DZ Interset...

Transcribed newspaper articles from the NCSE collection

CLOCR-C: Transcribed newspaper articles from the NCSE collection This dataset contains 91 pairs of newspaper articles from the Nineteenth Century Serials Edition (NCSE). The...

Scrambled text: training Language Models to correct OCR errors using syntheti...

This data repository contains the key datasets required to reproduce the paper "Scrambled text: training Language Models to correct OCR errors using synthetic data". In addition...

Combining text and vision in compound semantics: Towards a cognitively plausi...

In the current state-of-the art distributionalsemantics model of the meaning of noun-noun compounds (such aschainsaw, but-terfly, home phone),CAOSS(Marelli...

Propositional Claim Detection (NLP Datensatz)

Es handelt sich um einen natural language processing (NLP) Trainingsdatensatz. Modelle, die auf diesen Daten trainiert werden, sollen Behauptungen klassifizieren können, die...

Evidence - Computer-assisted Interactive Extraction of Dictionary Examples fr...

Anonymized models from the expert and lay-user studies conducted in the project Evidence. Each model was train for 50-60 iterations on a specific word class (adjective, verb,...

Re3: A Holistic Framework and Dataset for Modeling Collaborative Document Rev...

A dataset of aligned scientific paper revisions manually labeled according to their action and intent, and supplemented with the respective peer reviews and human-written edit...

43 datasets found