Dataset - B2FIND

The Online conversation threads repository

This repository contains datasets with online conversation threads collected and analyzed by different researchers. Currently, you can find datsets from different news...

Evolution of Wikipedia Categories

Knowledge Space Lab: Design versus Emergence. Comparison between the structure and evolution of categories in the Wikipedia and the Universal Decimal Classification. 2009-2011....

Wikipedia Discussion Corpora

Various annotated Wikipedia resources

Wikipedia Edit Category Corpus

For the corpus itself, please refer to/cite: Johannes Daxenberger and Iryna Gurevych (2012). "A Corpus-Based Study of Edit Categories in Featured and Non-Featured Wikipedia...

Wikipedia Edit-Turn-Pairs

Corresponding and Non-Corresponding Edit-Turn-Pairs from the English Wikipedia. The ETP-gold corpus is based on article edits and discussion page turns from the English...

Comparable corpora of South-Slavic Wikipedias CLASSLA-Wikipedia 1.0

This comparable corpus collection consists of Wikipedia dumps of the Bosnian, Croatian, Macedonian, Montenegrin, Serbian, Serbo-Croatian and Slovenian Wikipedia, harvested on...

Wikipedia talk corpus Janes-Wiki 1.0

Janes-Wiki is an annotated corpus of discussion pages from the Slovene Wikipedia from the period 2003-08 to 2017-06. The corpus contains page and user talks and is structured...

Slovene corpus for general relation extraction SloREL 1.0

The SloREL corpus contains annotations for training relation extraction models on Slovene documents. It contains documents from Slovene Wikipedia with annotated entities and...

Slovene corpus for general relation extraction SloREL 1.1

The SloREL corpus contains annotations for training relation extraction models on Slovene documents. It contains documents from Slovene Wikipedia with annotated entities and...

Slovenian Definition Extraction training dataset DF_NDF_wiki_slo 1.0

The Slovenian definition extraction training dataset DF_NDF_wiki_slo contains 38613 sentences extracted from the Slovenian Wikipedia. The first sentence of a term's description...

python-g419wikitools-1.0

Zestaw skryptów w języku Python do wygenerowania słownika odmiany fraz w oparciu o linki wewnętrzne Wikipedii. Efektem analizy dumpa Wikipedii jest zestaw plików, zawierających:...

CorpusExplorer

Software for corpus linguists and text/data mining enthusiasts. The CorpusExplorer combines over 45 interactive visualizations under a user-friendly interface. Routine tasks...

English-Czech Corpus from Wikipedia

Sentence-parallel corpus made from English and Czech Wikipedias based on translated articles from English into Czech. The work done is described in the paper: ŠTROMAJEROVÁ,...

Plaintext Wikipedia dump 2018

Wikipedia plain text data obtained from Wikipedia dumps with WikiExtractor in February 2018. The data come from all Wikipedias for which dumps could be downloaded at...

14 datasets found