The Sarajevo Corpus of SMS Messages in Bosnian 1.0

Dataset

PID

This corpus is specialized, static (i.e., no future growth is planned), diachronic and covers the period from 2002 to 2022.

The SMS messages included in this corpus were obtained from voluntary donors (informants). Both senders and recipients of the messages included in the corpus are Bosnian speakers, exhibiting diversity in terms of age, education and occupation, place of origin and countries of long-term residence.

The Sarajevo Corpus of SMS Messages in Bosnian was originally published by University of Sarajevo – Faculty of Philosophy as an electronic book. The second phase of the work involved compiling the SMS messages into a corpus and linguistic annotation, which was done using the CLASSLA package (https://github.com/clarinsi/classla), version 2.1, with language = Serbian and type = nonstandard for tokenization, lemmatization and morpho-syntactic tagging (both MULTEXT-East and Universal Dependencies).

Identifier
PID	http://hdl.handle.net/11356/1913
Related Identifier	http://hdl.handle.net/11356/1956
Related Identifier	https://www.ff.unsa.ba/index.php/bs/projekti-centra-za-b-h-s-jezik/18335-sarajevski-korpus-sms-poruka-na-bosanskom-jeziku
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1913

Provenance
Creator	Wasserscheidt, Philipp; Bulić, Halid; Durmišević, Elma; Hodžić-Čavkić, Azra; Bajraktarević, Enisa; Ahmetspahić-Peljto, Azra; Šabić, Belmin
Publisher	University of Sarajevo – Faculty of Philosophy
Publication Year	2024
Rights	Creative Commons - Attribution 4.0 International (CC BY 4.0); https://creativecommons.org/licenses/by/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Bosnian
Resource Type	corpus
Format	text/plain; charset=utf-8; application/zip; downloadable_files_count: 1
Discipline	Linguistics