Slovene Web genre identification corpus GINCO 1.0

Dataset

PID

The Slovene Web genre identification corpus GINCO 1.0 contains web texts, manually annotated with genre, from two Slovene web corpora, the slWaC 2.0 corpus, crawled in 2014, and a web corpus, crawled in 2021 in the scope of the MaCoCu project.

The corpus allows for automated genre identification and genre analyses as well as other web corpora research, and comprises two parts: - subcorpus of suitable texts, containing 1002 texts (478,969 words), manually annotated with 24 genre categories (News/Reporting, Announcement, Research Article, Instruction, Recipe, Call (such as a Call for Papers), Legal/Regulation, Information/Explanation, Opinionated News, Review, Opinion/Argumentation, Promotion of a Product, Promotion of Services, Invitation, Promotion, Interview, Forum, Correspondence, Script/Drama, Prose, Lyrical, FAQ (Frequently Asked Questions), List of Summaries/Excerpts, and Other)

subcorpus of unsuitable texts, containing 123 texts (173,778 words), discarded as not suitable for genre annotation due to reasons, encoded by the labels (Machine Translation, Generated Text, Not Slovene, Encoding Issues, HTML Source Code, Boilerplate, Too Short/Incoherent, Too Long (longer than 5,000 words), Non-Textual (no full sentences, e.g. tables, lists), and Multiple texts).

The texts in the suitable subset are annotated with up to three genre categories, where the primary label is the most prevalent, and secondary and tertiary labels denote presence of additional genre(s). They are encoded in three levels of detail, allowing experiments with the full set (24 labels), set of 21 labels (labels with less than 5 instances are merged with label Other) and set of 12 labels (similar labels are merged). Additionally, the corpus contains some metadata about the text (e.g. url, domain, year) and its paragraphs (e.g. near-duplicates and their usefulness for the genre identification).

Identifier
PID	http://hdl.handle.net/11356/1467
Related Identifier	https://aclanthology.org/2022.lrec-1.170.pdf
Related Identifier	https://macocu.eu/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1467

Provenance
Creator	Kuzman, Taja; Brglez, Mojca; Rupnik, Peter; Ljubešić, Nikola
Publisher	Jožef Stefan Institute
Publication Year	2021
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	corpus
Format	text/plain; charset=utf-8; application/zip; downloadable_files_count: 2
Discipline	Linguistics