EvaLatin 2020: data

Dataset

PID

Training and gold test data released in EvaLatin 2020, the evaluation campaign of NLP tools for Latin. The two shared tasks proposed in EvaLatin 2020, i. e. Lemmatization and Part-of-Speech tagging, were aimed at fostering research in the field of language technologies for Classical languages. The shared dataset consists of texts taken from the Perseus Digital Library, processed with UDPipe models and then manually corrected by Latin experts. The training set includes only prose texts by Classical authors. The test set, alongside with prose texts by the same authors represented in the training set, also includes data relative to poetry and to the Medieval period.

Identifier
PID	http://hdl.handle.net/20.500.11752/OPEN-526
Related Identifier	https://www.aclweb.org/anthology/2020.lt4hala-1.16.pdf
Related Identifier	https://github.com/CIRCSE/LT4HALA/tree/master/data_and_doc
Metadata Access	http://dspace-clarin-it.ilc.cnr.it/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:dspace-clarin-it.ilc.cnr.it:20.500.11752/OPEN-526

Provenance
Creator	Sprugnoli, Rachele; Pellegrini, Matteo; Cecchini, Flavio Massimiliano; Passarotti, Marco
Publisher	CIRCSE Research Centre, Università Cattolica del Sacro Cuore
Publication Year	2020
Funding Reference	info:eu-repo/grantAgreement/EC/H2020/769994
Rights	Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0); http://creativecommons.org/licenses/by-nc-sa/4.0/; PUB
OpenAccess	true
Contact	dspace-clarin-it-ilc-help(at)ilc.cnr.it

Representation
Language	Latin
Resource Type	corpus
Format	application/zip; text/plain; charset=utf-8; downloadable_files_count: 1
Discipline	Linguistics