Macedonian linguistic training corpus SETimes.MK 0.1

Dataset

PID

The SETimes.MK corpus is a sample of 570 sentences from the now unavailable setimes.com website of news articles on topics of South-Eastern Europe. The sentences were manually corrected for sentence splitting and tokenisation, while the morphosyntactic labels (following the MULTEXT-East standard for Macedonian https://nl.ijs.si/ME/V6/msd/html/msd-mk.html) and lemmas were automatically annotated with two iterations of preliminary models for Macedonian in the CLASSLA-Stanza tool (https://pypi.org/project/classla/), after which they were manually corrected. The UPOS+UFEATS morphosyntactic description has been assigned with the mapper available at https://github.com/clarinsi/macedonian-tagset-mapping.

The included sentences have their parallel counterparts inside the Croatian hr500k dataset (http://hdl.handle.net/11356/1792) and the Serbian SETimes.SR dataset (http://hdl.handle.net/11356/1843), and the sentence identifiers can be used to match corresponding sentences.

Please note that the dataset does not completely follow the Universal Dependencies specifications for Macedonian (https://universaldependencies.org/mk/index.html), as the UPOS+FEATS features in the dataset take as their basis the MULTEXT-East specifications, which differ in certain respects from the Universal Dependencies for Macedonian one.

Identifier
PID	http://hdl.handle.net/11356/1886
Related Identifier	https://www.clarin.si/info/k-centre/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1886

Provenance
Creator	Ljubešić, Nikola; Stojanovska, Biljana
Publisher	Jožef Stefan Institute
Publication Year	2023
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Macedonian
Resource Type	corpus
Format	text/plain; charset=utf-8; application/octet-stream; downloadable_files_count: 1
Discipline	Linguistics