NoReC: The Norwegian Review Corpus

Dataset

PID

While the NoReC dataset was primarily created for training and evaluating models for document-level sentiment analysis, many other use cases are of course possible. The corpus comprises more than 35,000 full-text reviews extracted from eight different major Norwegian news sources: Dagbladet, VG, Aftenposten, Bergens Tidende, Fædrelandsvennen, Stavanger Aftenblad, DinSide.no and P3.no. The reviews cover a range of different domains, including literature, movies, video games, restaurants, music and theater, in addition to product reviews across a range of categories. Each review is labeled with a manually assigned score of 1–6, as provided by the rating of the original author. The texts have been pre-processed using UDPipe and are distributed in the CoNLL-U format. However, we also provide HTML files with the raw texts. Documentation and an accompanying Python package are provided through the following git repository: https://github.com/ltgoslo/norec

Identifier
PID	http://hdl.handle.net/11509/124
Related Identifier	https://github.com/ltgoslo/norec
Metadata Access	https://repo.clarino.uib.no/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:repo.clarino.uib.no:11509/124

Provenance
Creator	Velldal, Erik; Øvrelid, Lilja; Bergem, Eivind Alexander; Stadsnes, Cathrine; Touileb, Samia; Jørgensen, Fredrik
Publisher	Department of Informatics, University of Oslo
Publication Year	2017
Rights	Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0); http://creativecommons.org/licenses/by-nc/3.0/; CC
OpenAccess	true
Contact	clarin(at)uib.no

Representation
Language	Norwegian Nynorsk; Nynorsk, Norwegian; Bokmål, Norwegian; Norwegian Bokmål; Norwegian
Resource Type	corpus
Format	application/gzip; text/plain; charset=utf-8; downloadable_files_count: 1
Discipline	Linguistics