Large-Scale Colloquial Persian 0.5

Dataset

PID

"Large Scale Colloquial Persian Dataset" (LSCP) is hierarchically organized in asemantic taxonomy that focuses on multi-task informal Persian language understanding as a comprehensive problem. LSCP includes 120M sentences from 27M casual Persian tweets with its dependency relations in syntactic annotation, Part-of-speech tags, sentiment polarity and automatic translation of original Persian sentences in five different languages (EN, CS, DE, IT, HI).

Identifier
PID	http://hdl.handle.net/11234/1-3195
Related Identifier	https://arxiv.org/abs/2003.06499
Related Identifier	https://iasbs.ac.ir/~ansari/lscp/
Metadata Access	http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:11234/1-3195

Provenance
Creator	Abdi Khojasteh, Hadi; Ansari, Ebrahim; Bohlouli, Mahdi
Publisher	Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL); Institute for Advanced Studies in Basic Sciences (IASBS)
Publication Year	2020
Rights	Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0); http://creativecommons.org/licenses/by-nc-nd/4.0/; PUB
OpenAccess	true
Contact	lindat-help(at)ufal.mff.cuni.cz

Representation
Language	Persian; Farsi; English; German; Czech; Italian; Hindi
Resource Type	corpus
Format	text/plain; charset=utf-8; application/octet-stream; downloadable_files_count: 9
Discipline	Linguistics