X-SRL Dataset and mBERT Word Aligner

DOI

This code contains a method to automatically align words from parallel sentences by using multilingual BERT pre-trained embeddings. This can be used to transfer source annotations (for example labeled English sentences) into the target side (for example a German translation of the sentence) by transferring the label into the best-aligned target word. This newly labeled data can be used to train different multilingual SOTA models to improve performance, especially for the lower-resource languages.

Identifier
DOI https://doi.org/10.11588/data/HVXXIJ
Metadata Access https://heidata.uni-heidelberg.de/oai?verb=GetRecord&metadataPrefix=oai_datacite&identifier=doi:10.11588/data/HVXXIJ
Provenance
Creator Daza, Angel
Publisher heiDATA
Contributor Daza, Angel
Publication Year 2021
Rights info:eu-repo/semantics/openAccess
OpenAccess true
Contact Daza, Angel (Leibniz Institute for the German Language / Department of Computational Linguistics, Heidelberg University)
Representation
Resource Type program source code; Dataset
Format text/markdown; application/zip
Size 6131; 38643
Version 1.0
Discipline Humanities
Spatial Coverage Leibniz Institute for the German Language