The siParl corpus contains minutes of the Assembly of the Republic of Slovenia for 11th legislative period 1990-1992, minutes of the National Assembly of the Republic of Slovenia from the 1st to the 7th legislative period 1992-2018, minutes of the working bodies of the National Assembly of the Republic of Slovenia from the 2nd to the 7th legislative period 1996-2018, and minutes of the Council of the President of the National Assembly from the 2nd to the 7th legislative period 1996-2018. The corpus comprises over 10 thousand sessions, one million speeches or 200 million words. The corpus contains meta-data about the speakers, a typology of sessions etc. and structural, editorial and linguistic annotations. The corpus is encoded according to the Parla-CLARIN schema (https://github.com/clarin-eric/parla-clarin). Each mandate is in one directory, and each session in one file.
This item comprises the following datasets:
1. source DARAH-SI Parla-CLARIN encoded corpus;
2. linguistically annotatated Parla-CLARIN encoded corpus: tokenisation, MSD tagging, lemmatisation, Universal Dependencies features and syntactic parses, named entities;
3. linguisticaly annotated corpus in vertical format used by CWB and Sketch Engine concordancers; this format is simpler and smaller but does not contain all the information from the source TEI;
4. linguisticaly annotated corpus in CONLL-U format as used by Universal Dependencies
5. plain text of the corpus
Note that each dataset also includes TSV meta-data files on sessions (files) and speakers.
As opposed to the previous version 1.0, this version corrects many errors, has substantially better meta-data and the linguistic processing has more levels and less errors.