LitLat BERT

PID

Trilingual BERT-like (Bidirectional Encoder Representations from Transformers) model, trained on Lithuanian, Latvian, and English data. State of the art tool representing words/tokens as contextually dependent word embeddings, used for various NLP classification tasks by fine-tuning the model end-to-end. LitLat BERT are neural network weights and configuration files in pytorch format (i.e. to be used with pytorch library).

The corpora used for training the model have 4.07 billion tokens in total, of which 2.32 billion are English, 1.21 billion are Lithuanian and 0.53 billion are Latvian.

LitLat BERT is based on XLM-RoBERTa model and comes in two versions, one for usage with transformers library (https://github.com/huggingface/transformers), and one for usage with fairseq library (https://github.com/pytorch/fairseq). More information is in the readme.txt.

Identifier
PID http://hdl.handle.net/20.500.11821/42
Related Identifier https://embeddia.eu
Metadata Access https://clarin.vdu.lt/oai/request?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:clarin.vdu.lt:20.500.11821/42
Provenance
Creator Ulčar, Matej; Robnik-Šikonja, Marko
Publisher University of Ljubljana, Faculty of Computer and Information Science
Publication Year 2020
Rights PUB_CLARIN-LT_End-User-Licence-Agreement_EN-LT; https://clarin.vdu.lt/licenses/eula/PUB_CLARIN-LT_End-User-Licence-Agreement_EN-LT.htm; PUB
OpenAccess true
Contact info(at)clarin.vdu.lt
Representation
Language Lithuanian; Latvian; English
Resource Type toolService
Format text/plain; application/gzip; text/plain; charset=utf-8; downloadable_files_count: 3
Discipline Linguistics