The SETimes.MK corpus is a sample of 570 sentences from the now unavailable setimes.com website of news articles on topics of South-Eastern Europe. The sentences were manually corrected for sentence splitting and tokenisation, while the morphosyntactic labels (following the MULTEXT-East standard for Macedonian https://nl.ijs.si/ME/V6/msd/html/msd-mk.html) and lemmas were automatically annotated with two iterations of preliminary models for Macedonian in the CLASSLA-Stanza tool (https://pypi.org/project/classla/), after which they were manually corrected. The UPOS+UFEATS morphosyntactic description has been assigned with the mapper available at https://github.com/clarinsi/macedonian-tagset-mapping.
The included sentences have their parallel counterparts inside the Croatian hr500k dataset (http://hdl.handle.net/11356/1792) and the Serbian SETimes.SR dataset (http://hdl.handle.net/11356/1843), and the sentence identifiers can be used to match corresponding sentences.
Please note that the dataset does not completely follow the Universal Dependencies specifications for Macedonian (https://universaldependencies.org/mk/index.html), as the UPOS+FEATS features in the dataset take as their basis the MULTEXT-East specifications, which differ in certain respects from the Universal Dependencies for Macedonian one.