New Zhiyuan Report
Google researchers have proposed a multi language Bert embedding model for labse. The model can generate language independent cross language sentence embedding for 109 languages, and its performance is better than laser in cross language text retrieval.Recently, Google AI researchers have proposed a multilingual Bert embedding model called labse, which can generate language independent cross language sentence embedding for 109 languages. This paper, entitled "language agnostic Bert sentence embedding", has been published in arXiv.
Thesis address:https://arxiv.org/pdf/2007.01852.pdfResearch background
Multilingual embedding model is a powerful tool, which can encode the texts of different languages into the shared embedded space, so that it can be applied to a series of downstream tasks, such as text classification, text clustering, etc. at the same time, it also uses semantic information to understand language.
Existing methods for generating such embeddings, such as laser or m ~ use, rely on parallel data to map sentences directly from one language to another, encouraging consistency between sentence embeddings.
Although these existing multi language embedding methods have good overall performance in multiple languages, they usually perform poorly in high resource languages compared with dedicated bilingual models.
In addition, due to limited model capacity and poor training data quality in low resource languages, it may be difficult to extend the multilingual model to support more languages while maintaining good performance.
The latest research on improving language models with examples of multi language embedding spaces includes the development of pre training for masked language models (MLM), such as those used by Bert, Alber and Roberta. Since this method only needs the characters of one language, it has made remarkable achievements in many languages and various natural language processing tasks.
In addition, MLM pre training has been extended to multiple languages by modifying MLM pre training to include cascading translation pairs, also known as translation language model (TLM), or introducing only pre training data from multiple languages.
However, although the internal model representation learned during MLM and TLM training is helpful for fine tuning downstream tasks, they cannot directly generate sentence embedding, which is very important for translation tasks.
In this case, researchers propose a multi language Bert embedding model called labse.
The model uses MLM and TLM pre training to train 17 billion monolingual sentences and 6 billion bilingual sentence pairs, even in low resource languages where no data is available during training. In addition, the model has good performance on multiple parallel text retrieval tasks.
The labse model can provide extended support for 109 languages in a single model by collecting training data of 109 supporting languages. In previous work, researchers suggested using translation ranking task to learn multilingual sentence embedding space. In this method, the model is sorted by given sentences in the source language, so as to rank the correct translation of sentences in the target language.
The translation ranking task is trained by using a dual encoder architecture with a shared transformer, which makes the bilingual model show the most advanced performance in multiple parallel text retrieval tasks.
However, due to the limitations of model capability, vocabulary coverage and training data quality, the bilingual model does not perform well when it is extended to support multiple languages (16 languages in the researcher's test case).
The translation ranking task has LaBSE, researchers to use the latest results of language model pre-training on similar BERT architectures, including MLM and TLM,, and fine-tune translation ranking tasks.
A 12 layer converter of 500K token vocabulary pre trained in 109 languages using MLM and TLM is used to increase model and vocabulary coverage. Finally, labse model provides extended support for 109 languages in a single model.
The performance of labse model with double encoder architecture is better than laser in cross language text retrieval
The researchers evaluated the proposed model using the tatoeba corpus, a data set containing 1000 English aligned sentence pairs in up to 112 languages.
For more than 30 languages in the dataset, the model has no training data.
The task of the model is to find the nearest adjacent translation of a given sentence and calculate it using cosine distance.
To understand the performance of the language model at the beginning or end of the training data distribution, the researchers divided the language set into groups and calculated the average accuracy of each group of languages.
The following table lists the average accuracy comparison results with m ￣ use, laser and labse models of each language group.
As you can see, all models performed well in the 14 language groups covering most of the major languages.
As the number of languages covered increases, the average accuracy of laser and labse decreases.
However, with the increase of the number of languages, the accuracy of labse model is much smaller, which is obviously better than laser. Especially when the whole part of 112 languages are included, the accuracy of labse is 83.7%, and that of laser is 65.5%.
In addition, labse can also be used to mine parallel text from web scale data. Google researchers have released pre trained models to the community through tfhub, including modules that can be used as is or fine tuned using domain specific data.