Croatian-English parallel corpus hrenWaC 2.0

PID

The hrenWaC corpus version 2.0 consists of parallel Croatian-English texts crawled from the .hr top-level domain for Croatia. The corpus was built with Spidextor (https://github.com/abumatran/spidextor), a tool that glues together the output of SpiderLing used for crawling and Bitextor used for bitext extraction. The accuracy of the extracted bitext on the segment level is around 80% and on the word level around 84%.

Identifier
PID http://hdl.handle.net/11356/1058
Related Identifier http://nlp.ffzg.hr/resources/corpora/hrenwac/
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1058
Provenance
Creator Ljubešić, Nikola; Esplà-Gomis, Miquel; Ortiz Rojas, Sergio; Klubička, Filip; Toral, Antonio
Publisher Jožef Stefan Institute
Publication Year 2016
Funding Reference info:eu-repo/grantAgreement/EC/FP7/324414
Rights CLARIN.SI User Licence for Internet Corpora; https://www.clarin.si/info/wp-content/uploads/2016/01/CLARIN.SI-WAC-2016-01.pdf; ACA
OpenAccess true
Contact info(at)clarin.si
Representation
Language Croatian; English
Resource Type corpus
Format text/plain; charset=utf-8; application/octet-stream; downloadable_files_count: 1
Discipline Linguistics