Annotated sample of the Slovenian Biographical Lexicon SBL-51abbr 1.0

Dataset

PID

This dataset consists of 51 randomly selected entries from the Slovenian Biographical Lexicon (1925–1991). The text of each entry has been manually tokenised and sentence segmented, marked with named entities and the words lemmatised. It has also been automatically annotated with PoS tags (MULTEXT-East morphosyntactic descriptions) and Universal Dependencies PoS tags, morphological features and dependency parses.

Crucially for the envisaged use of the corpus, the abbreviations in the corpus (of which there are 2,041) have been manually expanded so that the expanded abbreviations are also in the correct inflected form, given their context.

The corpus is available in the canonical TEI encoding, and derived plain text and CoNLL-U files. The plain-text file has abbreviations and their expansions marked up with [...]. There are two CoNLL-U files, one with the text stream with abbreviations, and one with the text stream with expansions. Note that only the one with expansions has syntactic parses. Both CoNLL-U files have the expansions / abbreviations and named entities marked up in IOB format in the last column.

Identifier
PID	http://hdl.handle.net/11356/1588
Related Identifier	https://aclanthology.org/2022.emnlp-main.596/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1588

Provenance
Creator	Erjavec, Tomaž; Vide Ogrin, Petra; Lenardič, Jakob; Mlinar Strgar, Mojca; Frankl, Simona
Publisher	Slovenian Academy of Sciences and Arts
Publication Year	2022
Funding Reference	info:eu-repo/grantAgreement/EC/H2020/101004825
Rights	Creative Commons - Attribution 4.0 International (CC BY 4.0); https://creativecommons.org/licenses/by/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	corpus
Format	text/plain; charset=utf-8; application/zip; downloadable_files_count: 3
Discipline	Linguistics