Slovene text simplification dataset SloTS

Dataset

PID

To increase the accessibility and diversity of easy reading in Slovenian and to create a prototype system that automatically simplifies texts in Slovenian, we prepared a dataset for the Slovenian language that contains aligned simple and complex sentences, which can be used for further development of models for simplifying texts in Slovenian.

Dataset is a .json file that usually contains one complex ("kompleksni") and one simplified sentence ("enostavni") per row. However, if a complex sentence contains a lot of information we translated this sentence into more than one simplified sentences. Vice versa, more complex sentences can be translated into one simplified sentence if some information is given through more than one complex sentences but we summarised them into one simplified one.

Identifier
PID	http://hdl.handle.net/11356/1682
Related Identifier	https://github.com/sabina-skubic/text-simplification-slovene/tree/main/master-thesis
Related Identifier	https://github.com/sabina-skubic/text-simplification-slovene
Metadata Access	http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1682

Provenance
Creator	Gorenc, Sabina; Robnik-Šikonja, Marko
Publisher	Faculty of Computer and Information Science, University of Ljubljana
Publication Year	2022
Rights	Creative Commons - Attribution 4.0 International (CC BY 4.0); https://creativecommons.org/licenses/by/4.0/; PUB
OpenAccess	true
Contact	info(at)clarin.si

Representation
Language	Slovenian; Slovene
Resource Type	corpus
Format	text/plain; charset=utf-8; application/octet-stream; downloadable_files_count: 1
Discipline	Linguistics