Slovene Web genre identification corpus GINCO 1.0

PID

The Slovene Web genre identification corpus GINCO 1.0 contains web texts, manually annotated with genre, from two Slovene web corpora, the slWaC 2.0 corpus, crawled in 2014, and a web corpus, crawled in 2021 in the scope of the MaCoCu project.

The corpus allows for automated genre identification and genre analyses as well as other web corpora research, and comprises two parts: - subcorpus of suitable texts, containing 1002 texts (478,969 words), manually annotated with 24 genre categories (News/Reporting, Announcement, Research Article, Instruction, Recipe, Call (such as a Call for Papers), Legal/Regulation, Information/Explanation, Opinionated News, Review, Opinion/Argumentation, Promotion of a Product, Promotion of Services, Invitation, Promotion, Interview, Forum, Correspondence, Script/Drama, Prose, Lyrical, FAQ (Frequently Asked Questions), List of Summaries/Excerpts, and Other)

  • subcorpus of unsuitable texts, containing 123 texts (173,778 words), discarded as not suitable for genre annotation due to reasons, encoded by the labels (Machine Translation, Generated Text, Not Slovene, Encoding Issues, HTML Source Code, Boilerplate, Too Short/Incoherent, Too Long (longer than 5,000 words), Non-Textual (no full sentences, e.g. tables, lists), and Multiple texts).

The texts in the suitable subset are annotated with up to three genre categories, where the primary label is the most prevalent, and secondary and tertiary labels denote presence of additional genre(s). They are encoded in three levels of detail, allowing experiments with the full set (24 labels), set of 21 labels (labels with less than 5 instances are merged with label Other) and set of 12 labels (similar labels are merged). Additionally, the corpus contains some metadata about the text (e.g. url, domain, year) and its paragraphs (e.g. near-duplicates and their usefulness for the genre identification).

Identifier
PID http://hdl.handle.net/11356/1467
Related Identifier https://macocu.eu/
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1467
Provenance
Creator Kuzman, Taja; Brglez, Mojca; Rupnik, Peter; Ljubešić, Nikola
Publisher Jožef Stefan Institute
Publication Year 2021
Rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 2
Discipline Linguistics