Annotated collocation candidates for three common syntactic structures in Slovene

PID

This resource contains 713,310 collocation candidates, which were automatically extracted from the Gigafida 2.0 corpus (http://hdl.handle.net/11356/1320) and annotated whether they are legitimate collocations or not. The collocation candidates belong to three syntactic structures that are among the most common and semantically most informative collocational structures in the Slovenian language: - Verb + Noun in accusative (Structure_ID = 23; Structure_name = gg-s4;#1#-1_2_dve). Contains 163,229 annotated collocation candidates. - Adjective + Noun (Structure_ID = 34; Structure_name = p0-s0;2_1_dol-#2#). Contains 342,714 annotated collocation candidates. - Noun + Noun in genitive (Structure_ID = 53; Structure_name = s0-s2;#1#-1_2_dol). Contains 207,367 collocation candidates. Structure IDs and structure names are provided as used in the Digital Dictionary Database at the Centre for Language Resources and Technologies at the University of Ljubljana (https://www.cjvt.si/en/).

In the annotation, three types of decision were possible: a) YES. The collocation candidate is a legitimate collocation, i.e., it is statistically relevant, represents the right syntactic structure, and shows meaningful but transparent semantic word combination. b) EXTENDED. The collocation candidate may be considered a collocation but in most cases or always requires a third element. c) NO. The collocation candidate is not a collocation. This can be for example because of a problem in lemmatisation, morphosyntactic annotation etc., or because the candidate is a compound, phrase etc., i.e., some other multiword unit. It should be noted that the annotation did not consider the criterion of collocation relevance, e.g., which collocations would make it into a dictionary or a related source. We consider this as a next step in using this data. However, part of the relevance has been included in the selection method, as the collocation candidates were selected using noun, adjective and verb headwords from Collocation Dictionary of Modern Slovene 1.0 (http://hdl.handle.net/11356/1250), taking up to top 30 collocations with a minimum frequency of 4 for each headword per syntactic structure.

Identifier
PID http://hdl.handle.net/11356/1903
Related Identifier https://www.clarin.si/info/services/projects/#CLARINSI_project_reports_2023
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1903
Provenance
Creator Kosem, Iztok; Gantar, Polona; Roblek, Rebeka; Zgaga, Karolina
Publisher Jožef Stefan Institute
Publication Year 2023
Rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene
Resource Type lexicalConceptualResource
Format application/zip; text/plain; charset=utf-8; downloadable_files_count: 1
Discipline Linguistics