This resource contains 713,310 collocation candidates, which were automatically extracted from the Gigafida 2.0 corpus (http://hdl.handle.net/11356/1320) and annotated whether they are legitimate collocations or not. The collocation candidates belong to three syntactic structures that are among the most common and semantically most informative collocational structures in the Slovenian language:
- Verb + Noun in accusative (Structure_ID = 23; Structure_name = gg-s4;#1#-1_2_dve). Contains 163,229 annotated collocation candidates.
- Adjective + Noun (Structure_ID = 34; Structure_name = p0-s0;2_1_dol-#2#). Contains 342,714 annotated collocation candidates.
- Noun + Noun in genitive (Structure_ID = 53; Structure_name = s0-s2;#1#-1_2_dol). Contains 207,367 collocation candidates.
Structure IDs and structure names are provided as used in the Digital Dictionary Database at the Centre for Language Resources and Technologies at the University of Ljubljana (https://www.cjvt.si/en/).
In the annotation, three types of decision were possible:
a) YES. The collocation candidate is a legitimate collocation, i.e., it is statistically relevant, represents the right syntactic structure, and shows meaningful but transparent semantic word combination.
b) EXTENDED. The collocation candidate may be considered a collocation but in most cases or always requires a third element.
c) NO. The collocation candidate is not a collocation. This can be for example because of a problem in lemmatisation, morphosyntactic annotation etc., or because the candidate is a compound, phrase etc., i.e., some other multiword unit.
It should be noted that the annotation did not consider the criterion of collocation relevance, e.g., which collocations would make it into a dictionary or a related source. We consider this as a next step in using this data. However, part of the relevance has been included in the selection method, as the collocation candidates were selected using noun, adjective and verb headwords from Collocation Dictionary of Modern Slovene 1.0 (http://hdl.handle.net/11356/1250), taking up to top 30 collocations with a minimum frequency of 4 for each headword per syntactic structure.