Icelandic web corpus MaCoCu-is 1.0

PID

The Icelandic web corpus MaCoCu-is 1.0 was built by crawling the ".is" internet top-level domain in 2021, extending the crawl dynamically to other domains as well. The crawler is available at https://github.com/macocu/MaCoCu-crawler.

Considerable efforts were devoted into cleaning the extracted text to provide a high-quality web corpus. This was achieved by removing boilerplate (https://corpus.tools/wiki/Justext) and near-duplicated paragraphs (https://corpus.tools/wiki/Onion), discarding very short texts as well as texts that are not in the target language. The dataset is characterized by extensive metadata which allows filtering the dataset based on text quality and other criteria (https://github.com/bitextor/monotextor), making the corpus highly useful for corpus linguistics studies, as well as for training language models and other language technologies.

In the XML format, each document is accompanied by the following metadata: title, crawl date, url, domain, file type of the original document, distribution of languages inside the document, and a fluency score based on a language model. The text of each document is divided into paragraphs that are accompanied by metadata on the information whether a paragraph is a heading or not, metadata on the paragraph quality and fluency, the automatically identified language of the text in the paragraph, and information whether the paragraph contains sensitive information (identified via the Biroamer tool - https://github.com/bitextor/biroamer).

The TSV format delivers sentence-level data, and contains the following metadata: sentence URL, paragraph and sentence ID within the document, a simhash and a quality score, which allow filtering out near-duplicate sentences (all sentences with the same simhash can be deleted, except for the one with the highest quality score), the language of the sentence, information on sentence fluency, and information whether the sentence contains personal or sensitive information (identified via the Biroamer sensitive data and named entity recognizer).

Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus.

This action has received funding from the European Union's Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the author’s view. The Agency is not responsible for any use that may be made of the information it contains.

Identifier
PID http://hdl.handle.net/11356/1518
Related Identifier http://hdl.handle.net/11356/1805
Related Identifier https://macocu.eu/
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1518
Provenance
Creator Bañón, Marta; Esplà-Gomis, Miquel; Forcada, Mikel L.; García-Romero, Cristian; Kuzman, Taja; Ljubešić, Nikola; van Noord, Rik; Pla Sempere, Leopoldo; Ramírez-Sánchez, Gema; Rupnik, Peter; Suchomel, Vít; Toral, Antonio; van der Werff, Tobias; Zaragoza, Jaume
Publisher Jožef Stefan Institute; Prompsit; Rijksuniversiteit Groningen; Universitat d'Alacant
Publication Year 2022
Rights CC0-No Rights Reserved; https://creativecommons.org/publicdomain/zero/1.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Icelandic
Resource Type corpus
Format text/plain; charset=utf-8; application/gzip; application/octet-stream; downloadable_files_count: 3
Discipline Linguistics