Lithuanian 2-gram dataset

PID

Dataset of 2-grams with frequencies extracted from Delfi.lt corpus (~ 70 million words, period: March 2014 - November 2016). Firstly corpus was split into sentences, then symbol analysis as well as analysis of intended structures made of symbols were performed. Also, dictionary of abbreviations was used in order to preserve various abbreviations. Finally, 2-grams generated, making all in all 67 million entries. Frequencies of all entries were added to the dataset as well.

Identifier
PID http://hdl.handle.net/20.500.11821/25
Related Identifier http://mwe.lt/
Metadata Access https://clarin.vdu.lt/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:clarin.vdu.lt:20.500.11821/25
Provenance
Creator Bielinskienė, Agnė; Boizou, Loïc; Bumbulienė, Ieva; Kovalevskaitė, Jolanta; Krilavičius, Tomas; Mandravickaitė, Justina; Rimkutė, Erika; Vilkaitė-Lozdienė, Laura
Publisher Baltic Institute of Advanced Technology; Vytautas Magnus University
Publication Year 2019
Rights PUB_CLARIN-LT_End-User-Licence-Agreement_EN-LT; https://clarin.vdu.lt/licenses/eula/PUB_CLARIN-LT_End-User-Licence-Agreement_EN-LT.htm; PUB
OpenAccess true
Contact info(at)clarin.vdu.lt
Representation
Language Lithuanian
Resource Type lexicalConceptualResource
Format application/zip; text/plain; charset=utf-8; downloadable_files_count: 1
Discipline Linguistics