Slovenian parliamentary corpus (1990-2018) siParl 2.0

Dataset

PID

The siParl corpus contains minutes of the Assembly of the Republic of Slovenia for 11th legislative period 1990-1992, minutes of the National Assembly of the Republic of Slovenia from the 1st to the 7th legislative period 1992-2018, minutes of the working bodies of the National Assembly of the Republic of Slovenia from the 2nd to the 7th legislative period 1996-2018, and minutes of the Council of the President of the National Assembly from the 2nd to the 7th legislative period 1996-2018. The corpus comprises over 10 thousand sessions, one million speeches or 200 million words. The corpus contains meta-data about the speakers, a typology of sessions etc. and structural, editorial and linguistic annotations. The corpus is encoded according to the Parla-CLARIN schema (https://github.com/clarin-eric/parla-clarin). Each mandate is in one directory, and each session in one file.

This item comprises the following datasets: 1. source DARAH-SI Parla-CLARIN encoded corpus; 2. linguistically annotatated Parla-CLARIN encoded corpus: tokenisation, MSD tagging, lemmatisation, Universal Dependencies features and syntactic parses, named entities; 3. linguisticaly annotated corpus in vertical format used by CWB and Sketch Engine concordancers; this format is simpler and smaller but does not contain all the information from the source TEI; 4. linguisticaly annotated corpus in CONLL-U format as used by Universal Dependencies 5. plain text of the corpus

Note that each dataset also includes TSV meta-data files on sessions (files) and speakers.

As opposed to the previous version 1.0, this version corrects many errors, has substantially better meta-data and the linguistic processing has more levels and less errors.

Identifier
PID	http://hdl.handle.net/11356/1300
Related Identifier	http://hdl.handle.net/11356/1236
Related Identifier	http://hdl.handle.net/11356/1748
Related Identifier	https://github.com/DARIAH-SI/siParl/
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1300

Provenance
Creator	Pančur, Andrej; Erjavec, Tomaž; Ojsteršek, Mihael; Šorn, Mojca; Blaj Hribar, Neja
Publisher	Institute of Contemporary History
Publication Year	2020
Rights	Creative Commons - Attribution 4.0 International (CC BY 4.0); https://creativecommons.org/licenses/by/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	corpus
Format	application/zip; text/plain; charset=utf-8; downloadable_files_count: 6
Discipline	Linguistics