This dataset consists of 51 randomly selected entries from the Slovenian Biographical Lexicon (1925–1991). The text of each entry has been manually tokenised and sentence segmented, marked with named entities and the words lemmatised. It has also been automatically annotated with PoS tags (MULTEXT-East morphosyntactic descriptions) and Universal Dependencies PoS tags, morphological features and dependency parses.
Crucially for the envisaged use of the corpus, the abbreviations in the corpus (of which there are 2,041) have been manually expanded so that the expanded abbreviations are also in the correct inflected form, given their context.
The corpus is available in the canonical TEI encoding, and derived plain text and CoNLL-U files. The plain-text file has abbreviations and their expansions marked up with [...]. There are two CoNLL-U files, one with the text stream with abbreviations, and one with the text stream with expansions. Note that only the one with expansions has syntactic parses. Both CoNLL-U files have the expansions / abbreviations and named entities marked up in IOB format in the last column.