Slovene text simplification dataset SloTS

PID

To increase the accessibility and diversity of easy reading in Slovenian and to create a prototype system that automatically simplifies texts in Slovenian, we prepared a dataset for the Slovenian language that contains aligned simple and complex sentences, which can be used for further development of models for simplifying texts in Slovenian.

Dataset is a .json file that usually contains one complex ("kompleksni") and one simplified sentence ("enostavni") per row. However, if a complex sentence contains a lot of information we translated this sentence into more than one simplified sentences. Vice versa, more complex sentences can be translated into one simplified sentence if some information is given through more than one complex sentences but we summarised them into one simplified one.

Identifier
PID http://hdl.handle.net/11356/1682
Related Identifier https://github.com/sabina-skubic/text-simplification-slovene/tree/main/master-thesis
Related Identifier https://github.com/sabina-skubic/text-simplification-slovene
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1682
Provenance
Creator Gorenc, Sabina; Robnik-Šikonja, Marko
Publisher Faculty of Computer and Information Science, University of Ljubljana
Publication Year 2022
Rights Creative Commons - Attribution 4.0 International (CC BY 4.0); https://creativecommons.org/licenses/by/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene
Resource Type corpus
Format text/plain; charset=utf-8; application/octet-stream; downloadable_files_count: 1
Discipline Linguistics