Morphological lexicon Sloleks 3.0

Dataset

PID

Sloleks is a reference morphological lexicon of Slovene that was developed to be used in various NLP applications and language manuals. It contains Slovene lemmas, their inflected or derivative word forms and the corresponding grammatical description. In addition to the approx. 100,000 entries already available in Sloleks 2.0 (http://hdl.handle.net/11356/1230), Sloleks 3.0 contains an additional cca. 265,000 newly generated entries from the most frequent lemmas in Gigafida 2.0 (http://hdl.handle.net/11356/1320) not yet included in previous versions of Sloleks. For verbs, adjectives, adverbs, and common nouns, the lemmas were checked manually by three annotators and included in Sloleks only if confirmed as legitimate by at least one annotator. No manual checking was performed on proper nouns.

Lemmatization rules, part-of-speech categorization and the set of feature-value pairs follow the MULTEXT-East morphosyntactic specifications for Slovenian (https://nl.ijs.si/ME/V6/msd/html/msd-sl.html). In addition to grammatical information, each word form is also given the information on its absolute corpus frequency and its compliance with the reference language standard. In addition, most entries contain information on their morphological patterns (see http://hdl.handle.net/11356/1411 for more on morphological patterns).

The lexicon also includes accentuated word forms automatically generated through neural networks (Krsnik 2017). For the 100,000 entries from Sloleks 2.0, the accentuated forms were manually corrected, whereas the accentuated forms for the other 265,000 entries are fully automatic. IPA and SAMPA phonetic transcriptions were generated automatically using an improved G2P system for Slovene developed within the RSDO project (see https://github.com/clarinsi/slovene_g2p).

Version 3.0 is encoded in XML, but unlike 2.0, which used the LMF format, the new version uses a custom XML format developed for the morphological lexicon by the Centre for Language Resources and Technologies of the University of Ljubljana (see the included .xsd files and "00README.txt" for details).

Reference: Krsnik, Luka. Napovedovanje naglasa slovenskih besed z metodami strojnega učenja: magistrsko delo: magistrski program druge stopnje Računalništvo in informatika. Ljubljana: [L. Krsnik], 2017. http://eprints.fri.uni-lj.si/3978/

Identifier
PID	http://hdl.handle.net/11356/1745
Related Identifier	http://hdl.handle.net/11356/1230
Related Identifier	https://rsdo.slovenscina.eu/en/language-resources
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1745

Provenance
Creator	Čibej, Jaka; Gantar, Kaja; Dobrovoljc, Kaja; Krek, Simon; Holozan, Peter; Erjavec, Tomaž; Romih, Miro; Arhar Holdt, Špela; Krsnik, Luka; Robnik-Šikonja, Marko
Publisher	Centre for Language Resources and Technologies, University of Ljubljana
Publication Year	2022
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	lexicalConceptualResource
Format	text/plain; charset=utf-8; application/zip; downloadable_files_count: 1
Discipline	Linguistics