2,683 datasets found

  • Prague Dependency Treebank of Spoken Czech 2.0 (PDTSC 2.0)

    The Prague Dependency Treebank of Spoken Czech 2.0 (PDTSC 2.0) is a corpus of spoken language, consisting of 742,316 tokens and 73,835 sentences, representing 7,324 minutes...
  • Randomized extraction of the New Norwegian corpus

    Randomized extraction of the New Norwegian Corpus (Nynorskkorpuset). Contains sentences in New Norwegian (Nynorsk) from the year 2000 and after. Tab-separated, one word pr....
  • Morpho-syntactically annotated corpora provided for the PARSEME Shared Task o...

    This multilingual resource contains corpora for 14 languages, gathered at the occasion of the 1.2 edition of the PARSEME Shared Task on semi-supervised Identification of Verbal...
  • Prague Dependency Treebank - Consolidated 1.0 (PDT-C 1.0)

    A richly annotated and genre-diversified language resource, The Prague Dependency Treebank – Consolidated 1.0 (PDT-C 1.0, or PDT-C in short in the sequel) is a consolidated...
  • Terminology identification dataset KAS-term 1.0

    The dataset contains 22,950 term candidates extracted from 15 Slovenian PhD theses. The term candidates are of length 1 to 4, extracted via morphosyntactic patterns and the...
  • PDT-Vallex: Czech Valency lexicon linked to treebanks 4.0 (PDT-Vallex 4.0)

    The valency lexicon PDT-Vallex 4.0 has been built in close connection with the annotation of the Prague Dependency Treebank project (PDT) and its successors (mainly the Prague...
  • MERLIN Written Learner Corpus for Czech, German, Italian 1.1

    The MERLIN corpus is a written learner corpus for Czech, German, and Italian that has been designed to illustrate the Common European Framework of Reference for Languages (CEFR)...
  • Deep Sequoia corpus - PARSEME-FR corpus - FrSemCor

    The Sequoia corpus is a set of 3,099 linguistically-annotated French sentences, originating from four sources (Europarl, European Agency Reports, French regional journal L'Est...
  • ParCzech PS7 2.0

    The ParCzech PS7 2.0 corpus is the second version of ParCzech PS7 consisting of stenographic protocols that record the Chamber of Deputies' meetings held in the 7th term between...
  • ParCzech PS7 1.0

    The ParCzech PS7 1.0 corpus is the very first member of the corpus family of data coming from the Parliament of the Czech Republic. ParCzech PS7 1.0 consists of stenographic...
  • Prague Dependency Treebank 3.5

    The Prague Dependency Treebank 3.5 is the 2018 edition of the core Prague Dependency Treebank (PDT). It contains all PDT annotation made at the Institute of Formal and Applied...
  • Korpus 2

    Korpus 2
  • MorfFlex CZ 2.0

    MorfFlex CZ 2.0 is the Czech morphological dictionary developed originally by Jan Hajič as a spelling checker and lemmatization dictionary. MorfFlex is a flat list of...
  • Prague DaTabase of Spoken Czech 1.0

    PDTSC 1.0 is a multi-purpose corpus of spoken language. 768,888 tokens, 73,374 sentences and 7,324 minutes of spontaneous dialog speech have been recorded, transcribed and...
  • MorfFlex CZ 161115

    Czech morphological dictionary developed originally by Jan Hajič as a spelling checker and lemmatization dictionary. Currently it contains full morphological information for...
  • Spoken Torlak dialect corpus 1.0 (transcription)

    Torlak corpus represents a spoken variety of the endangered Torlak dialect from the Timok area in Southeast Serbia. It comprises transcripts of interviews with the local...
  • Tigrinya Web Corpus

    Tigrinya web corpus. Crawled by SpiderLing in January 2016. Encoded in UTF-8, cleaned, deduplicated.
  • LitLat BERT

    Trilingual BERT-like (Bidirectional Encoder Representations from Transformers) model, trained on Lithuanian, Latvian, and English data. State of the art tool representing...
  • Lithuanian Word embeddings

    GloVe type word vectors (embeddings) for Lithuanian. Delfi.lt corpus (~70 million words) and StanfordNLP were used for training. The training consisted of several stages: 1)...
  • Big Data language model - subword - SYLLABED - RAW

    Big data language model based on syllabes in RAW format