ASR training dataset for Serbian JuzneVesti-SR v1.0

Dataset

PID

The JuzneVesti-SR dataset consists of audio recordings and manual transcripts from the Južne Vesti website and its host show called '15 minuta' (https://www.juznevesti.com/Tagovi/Intervju-15-minuta.sr.html). The processing of the audio and its alignment to the manual transcripts followed the pipeline of the ParlaSpeech-HR dataset (http://hdl.handle.net/11356/1494) as closely as possible. Segments in this dataset range from 2 to 30 seconds. Train-dev-test split has been performed with 80:10:10 ratio. As with the ParlaSpeech-HR dataset, two transcriptions are provided; one with transcripts in their raw form (with punctuation, capital letters, numerals) and another normalised with the same rule-based normaliser as was used in ParlaSpeech-HR dataset creation, which is lowercased, punctuation is removed and numerals are replaced with words. The speaker_info attribute is less abundant due to the fact that compared to parliamentary corpora less data is available in this domain, so it covers only the guest name, guest description, host name, and speaker breakdown (when the host or the guest are speaking).

Original transcripts were collected with the help of the ReLDI Centre Belgrade (https://reldi.spur.uzh.ch).

Identifier
PID	http://hdl.handle.net/11356/1679
Related Identifier	https://github.com/clarinsi/parlaspeech/tree/main/juzne_vesti
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1679

Provenance
Creator	Rupnik, Peter; Ljubešić, Nikola
Publisher	Jožef Stefan Institute
Publication Year	2022
Rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Serbian
Resource Type	corpus
Format	text/plain; charset=utf-8; application/octet-stream; downloadable_files_count: 7
Discipline	Linguistics