Collection of Slovenian paremiological units Pregovori 1.1

PID

This corpus collects and annotates the extensive and highly valuable diachronic collection of 37,390 Slovenian proverbs, 50 years and more in the making at the ZRC SAZU Institute of Slovenian Ethnology. Each proverb is linked to its source, and the sources comprise 2,630 bibliographical items (1578-2010): printed books, journals, calendars, collecting campaigns in different journals, folklore collecting field-works, personal notes, etc.

Each proverb is represented in two ways: in its diplomatic transcription faithful to its source (due to the technical difficulties of the transcribers and human errors in transcription, the transcription of older texts is inconsistent) and as the critical transcription which modernises the alphabet used.

The words of the critical transcriptions have also been automatically modernised to contemporary spelling using cSMTiser (https://github.com/clarinsi/csmtiser) trained on the goo300k corpus of historical Slovenian (http://hdl.handle.net/11356/1025), and these words further annotated with lemmas, MULTEXT-East morphosyntactic descriptions (https://nl.ijs.si/ME/V6/msd/html/msd-sl.html) and Universal dependencies (https://universaldependencies.org/) with the CLASSLA toolchain (https://github.com/clarinsi/classla).

The canonical encoding of the corpus is TEI, but the corpus is also distributed in two derived encodings. One is the proverbs and teh bibliography as two TSV files, and the other the vertical file with the proverbs, as used by CQP-type concordancers, such as Sketch Engine.

As opposed to the previous version 1.0, this version includes 1,183 more proverbs and 115 more bibliographical items and corrects some errors.

Identifier
PID http://hdl.handle.net/11356/1853
Related Identifier http://hdl.handle.net/11356/1455
Related Identifier https://isn2.zrc-sazu.si/en/programi-in-projekti/traditional-paremiological-units-in-dialogue-with-contemporary-use
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1853
Provenance
Creator Babič, Saša; Miha, Peče; Erjavec, Tomaž; Ivančič Kutin, Barbara; Šrimpf Vendramin, Katarina; Kropej Telban, Monika; Jakop, Nataša; Stanonik, Marija
Publisher ZRC SAZU; Jožef Stefan Institute
Publication Year 2023
Rights Creative Commons - Attribution 4.0 International (CC BY 4.0); https://creativecommons.org/licenses/by/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene
Resource Type corpus
Format text/plain; charset=utf-8; application/zip; downloadable_files_count: 3
Discipline Linguistics