C4Corpus (publicdomain part)

PID

A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.

Identifier
PID http://hdl.handle.net/11372/LRT-2209
Related Identifier http://www.lrec-conf.org/proceedings/lrec2016/pdf/388_Paper.pdf
Related Identifier https://dkpro.github.io/dkpro-c4corpus/
Metadata Access http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:11372/LRT-2209
Provenance
Creator Gurevych, Iryna; Habernal, Ivan; Zayed, Omnia
Publisher Technische Universität Darmstadt
Publication Year 2016
Rights Public Domain Mark (PD); http://creativecommons.org/publicdomain/mark/1.0/; PUB
OpenAccess true
Contact lindat-help(at)ufal.mff.cuni.cz
Representation
Language Afrikaans; Arabic; Bulgarian; Czech; Danish; German; Greek, Modern (1453-); Greek; English; Estonian; Persian; Farsi; Finnish; French; Croatian; Hungarian; Indonesian; Italian; Japanese; Korean; Latvian; Lithuanian; Dutch; Flemish; Norwegian; Polish; Portuguese; Russian; Slovenian; Slovene; Somali; Spanish; Castilian; Swahili; Swedish; Tagalog; Thai; Turkish; Ukrainian; Undetermined; Vietnamese
Resource Type corpus
Format text/plain; application/x-gzip; downloadable_files_count: 36
Discipline Linguistics