Lithuanian morphologically annotated corpus - MATAS v1.0

PID

MATAS corpus (version 1.0)

DESCRIPTION Manually checked, morphologically annotated corpus MATAS

FORMATS 1. CoNLL-U (CONLLU, conllu) 2. SketchEngine - tab delimited word per line (TAB-WPL, txt)

SIZE Wordform count: 1,693,410 Sentence count: 144,047

GENRES Contains 5 genres: Documents (14%), Fiction (19%), Periodicals (36%), Scientific texts (24%), Transcripts(7%)

TAGSETS morphological annotation presented with 3 different tagsets: - Universal Dependencies (POS 4 column, morphological categories 6 column), see universaldependencies.org; - Jablonskis (5 column) see Documentation folder; - Multext-EAST (10 column), see Documentation folder.

JABLONSKIS AND MULTEXT-EAST TAGSETS Jablonskis -> Lithuanian tagset -> human-readable Multext-East -> English tagset -> machine-readable

Please use the following text to cite this item: Rimkutė E., Daudaravičius V., Utka A. 2007: Morphological Annotation of the Lithuanian Corpus. Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics; Workshop Balto-Slavonic Natural Language Processing 2007, Prague, 94–99.

Identifier
PID http://hdl.handle.net/20.500.11821/33
Metadata Access https://clarin.vdu.lt/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:clarin.vdu.lt:20.500.11821/33
Provenance
Creator Rimkutė, Erika; Bielinskienė, Agnė; Dadurkevičius, Virginijus; Kovalevskaitė, Jolanta; Utka, Andrius; Boizou, Loïc
Publisher Vytautas Magnus University
Publication Year 2019
Rights PUB_CLARIN-LT_End-User-Licence-Agreement_EN-LT; https://clarin.vdu.lt/licenses/eula/PUB_CLARIN-LT_End-User-Licence-Agreement_EN-LT.htm; PUB
OpenAccess true
Contact info(at)clarin.vdu.lt
Representation
Language Lithuanian
Resource Type corpus
Format text/plain; application/zip; text/plain; charset=utf-8; downloadable_files_count: 3
Discipline Linguistics