Finnish web corpus fiWaC 1.0

PID

The Finnish web corpus fiWaC was built by crawling the .fi top-level domain in 2015 for both Finnish and English documents. The corpus was naively tokenised (via spaces), near-deduplicated on paragraph level and paragraph-shuffled. Each paragraph contains metadata on the URL and language identification. The Finnish (~1.7B tokens) and English (~2B tokens) parts of the corpus are organised in separate files.

Identifier
PID http://hdl.handle.net/11356/1074
Related Identifier https://cordis.europa.eu/project/id/324414
Metadata Access http://www.clarin.si/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:www.clarin.si:11356/1074
Provenance
Creator Ljubešić, Nikola; Pirinen, Tommi; Toral, Antonio
Publisher Jožef Stefan Institute
Publication Year 2016
Funding Reference info:eu-repo/grantAgreement/EC/FP7/324414
Rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0); https://creativecommons.org/licenses/by-sa/4.0/; PUB
OpenAccess true
Contact info(at)clarin.si
Representation
Language Finnish; English
Resource Type corpus
Format application/gzip; text/plain; charset=utf-8; downloadable_files_count: 38
Discipline Linguistics