Abstracts from the KAS corpus KAS-Abs 1.0

PID

The KAS-abs corpus contains 108,254 automatically identified Slovenian and/or English abstracts (30 million words) from 62,000 BSc/BA, MSc/MA, and PhD theses included in the KAS Corpus of Academic Slovene. This corpus is made available because the public version of KAS (http://hdl.handle.net/11356/1244) does not contain the front matter, and hence the abstracts. The abstracts were identified on a per-page basis, and are either in Slovenian (-abs-sl.txt, 47,273 files), English (-abs-en.tx, 49,261 files) or, for cases where the abstracts in both languages were on the same page, in both languages (*-abs-slen.txt, 11,720 files). The files contain the plain text of the abstracts, one paragraph per line. Note that as the cleaning of source PDF files and identification of the abstracts was done automatically, this corpus contains various types of errors. The files are stored in the same manner as for the complete KAS corpus, i.e. in 1,000 directories with the same filename prefix as in KAS. The file with the metadata for the corpus texts is also included. The abstracts can be useful for research in e.g. machine translations and terminology extraction, and, using also the full texts from the KAS corpus, for studies in automatic summarisation.

Identifier
PID http://hdl.handle.net/11356/1420
Related Identifier https://rdcu.be/b7GrB
Related Identifier http://hdl.handle.net/11356/1449
Related Identifier http://nl.ijs.si/kas/
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1420
Provenance
Creator Erjavec, Tomaž; Fišer, Darja; Ljubešić, Nikola; Ferme, Marko; Borovič, Mladen; Boškovič, Borko; Ojsteršek, Milan; Hrovat, Goran
Publisher Jožef Stefan Institute; Faculty of Electrical Engineering and Computer Science, University of Maribor
Publication Year 2021
Rights CLARIN.SI Licence ACA ID-BY-NC-INF-NORED 1.0; https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0; ACA
OpenAccess true
Contact info(at)clarin.si
Representation
Language Slovenian; Slovene; English
Resource Type corpus
Format application/octet-stream; text/plain; charset=utf-8; downloadable_files_count: 1
Discipline Linguistics