Annotated sample of the Slovenian Biographical Lexicon SBL-51abbr 1.0


This dataset consists of 51 randomly selected entries from the Slovenian Biographical Lexicon (1925–1991). The text of each entry has been manually tokenised and sentence segmented, marked with named entities and the words lemmatised. It has also been automatically annotated with PoS tags (MULTEXT-East morphosyntactic descriptions) and Universal Dependencies PoS tags, morphological features and dependency parses.

Crucially for the envisaged use of the corpus, the abbreviations in the corpus (of which there are 2,041) have been manually expanded so that the expanded abbreviations are also in the correct inflected form, given their context.

The corpus is available in the canonical TEI encoding, and derived plain text and CoNLL-U files. The plain-text file has abbreviations and their expansions marked up with [...]. There are two CoNLL-U files, one with the text stream with abbreviations, and one with the text stream with expansions. Note that only the one with expansions has syntactic parses. Both CoNLL-U files have the expansions / abbreviations and named entities marked up in IOB format in the last column.

Related Identifier
Metadata Access
Creator Erjavec, Tomaž; Vide Ogrin, Petra; Lenardič, Jakob; Mlinar Strgar, Mojca; Frankl, Simona
Publisher Slovenian Academy of Sciences and Arts
Publication Year 2022
Funding Reference info:eu-repo/grantAgreement/EC/H2020/101004825
Rights Creative Commons - Attribution 4.0 International (CC BY 4.0);; PUB
OpenAccess true
Contact info(at)
Language Slovenian; Slovene
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 3
Discipline Linguistics