EnTam: An English-Tamil Parallel Corpus (EnTam v2.0)

Dataset

PID

EnTam is a sentence aligned English-Tamil bilingual corpus from some of the publicly available websites that we have collected for NLP research involving Tamil. The standard set of processing has been applied on the the raw web data before the data became available in sentence aligned English-Tamil parallel corpus suitable for various NLP tasks. The parallel corpus includes texts from bible, cinema and news domains.

Identifier
PID	http://hdl.handle.net/11234/1-1454
Related Identifier	http://ufal.mff.cuni.cz/~ramasamy/parallel/html/
Metadata Access	http://lindat.mff.cuni.cz/repository/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:lindat.mff.cuni.cz:11234/1-1454

Provenance
Creator	Ramasamy, Loganathan; Bojar, Ondřej; Žabokrtský, Zdeněk
Publisher	Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Publication Year	2014
Rights	Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0); http://creativecommons.org/licenses/by-nc-sa/3.0/; PUB
OpenAccess	true
Contact	lindat-help(at)ufal.mff.cuni.cz

Representation
Language	English; Tamil
Resource Type	corpus
Format	application/x-gzip; text/plain; charset=utf-8; downloadable_files_count: 1
Discipline	Linguistics