Serbian linguistic training corpus SETimes.SR 2.0

PID

The SETimes.SR training corpus contains around 100,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic dependencies, and named entities.

The annotation formalisms followed in the SETimes.SR corpus are (1) the MULTEXT-East V6 morphosyntactic specifications for the Serbo-Croatian macro-language, http://nl.ijs.si/ME/V6/msd/, (2) the UDv2 Guidelines, http://universaldependencies.org/guidelines.html, and (3) the Janes annotation guidelines for named entities, http://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf.

The difference to the previous version of the corpus are (1) the extension of the corpus with 502 sentences from various news sources and (2) improvements in the annotations of the corpus.

The continuous improvement of this dataset is led by the CLASSLA knowledge centre for South Slavic languages (https://www.clarin.si/info/k-centre/) and the ReLDI Centre Belgrade.

Identifier
PID http://hdl.handle.net/11356/1843
Related Identifier http://www.aclweb.org/anthology/W17-1407
Related Identifier http://hdl.handle.net/11356/1200
Related Identifier https://github.com/reldi-data/SETimes.SRPlus
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1843
Provenance
Creator Batanović, Vuk; Ljubešić, Nikola; Samardžić, Tanja; Erjavec, Tomaž
Publisher Regional Linguistic Data Initiative Centre ReLDI; Jožef Stefan Institute
Publication Year 2023
Rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); PUB; https://creativecommons.org/licenses/by-sa/4.0/
OpenAccess true
Contact info(at)clarin.si
Representation
Language Serbian
Resource Type corpus
Format text/plain; charset=utf-8; application/octet-stream; application/gzip; downloadable_files_count: 4
Discipline Linguistics