MATAS corpus (version 3.0)
DESCRIPTION
Updated, manually checked, morphologically annotated corpus MATAS
LANGUAGE
Lithuanian
PREVIOUS VERSIONS
1. MATAS v0.2 (http://hdl.handle.net/20.500.11821/9)
2. MATAS v1.0 (http://hdl.handle.net/20.500.11821/33)
FORMATS, STANDARTS
1. CoNLL-U (https://universaldependencies.org/format.html);
2. JABLONSKIS tagset v2 (https://sitti.vdu.lt/jablonskis-en/);
3. MULTEXT-East tagset (http://nl.ijs.si/ME/V4/msd/html/index.html)
4. UTF-8
SIZE
Tokens (incl. punctuation): 2,137,287
Words: 1,694,819
Sentences: 144,047
Documents: 1,234
GENRES
Contains 5 genres: Documents (14%), Fiction (19%), Periodicals (36%), Scientific texts (24%), Transcripts(7%)
PUBLISHER
Institute of Digital Resources and Interdisciplinary Research (SITTI), Vytautas Magnus University