Home > News content

Google Brain put forward a new attention model, translation task reasoning speed increased by 20%

via:博客园     time:2017/7/10 18:30:54     readed:1318

Attention model is one of the important developments in natural language processing in recent years. The attention model is a concept introduced from the human brain's attention model in cognitive psychology. When people observe the world, the degree of attention to different objects is different, such as when you read a word in the article carefully You can see the whole page of the text, but the focus of attention focused on this sentence, although the other words still in your eyes, but the actual assigned to the attention is very small. Natural language processing is also the same, the input text of the different parts of the output contribution is different, that is, the need to assign the weight of different weight. Use the attention model to get a better result.

Since the standard content-based attention mechanism is mainly applied in the sequence-to-sequence model, since the method needs to compare the states of the encoders and decoders in a large amount of time each time, a large amount of computing resources is required. Researchers Denny Britz, Melody Y. Guan and Minh-Thang Luong of Google Brain have proposed a high-intensity attention model with fixed size memory, which can increase the speed of translation task by 20%.

The following is the part of the translation of the contents of the paper.


The Sequence-to-sequence model has the best results in many tasks, such as Neural Machine Translation (NMT), text summarization, speech recognition, image subtitles, and dialog modeling.

The most popular attention method is based on the encoder-decoder architecture, which includes two cyclic neural networks and attention mechanisms that align the target with the source symbol. The typical attention mechanism used in this structure calculates the new attention context in each decoding step based on the current state of the decoder. More intuitively speaking, this corresponds to the output of each individual target symbol after viewing the source sequence.

Given how humans are dealing with the inspiration of the sentence, the researchers believe that it may not be necessary to review the entire source sequence in each step. Therefore, the researchers proposed an alternative mechanism, which can reduce the computational time complexity. This method predicts the K attention context vector when reading the source data. And learn to use the weighted average of these vectors in each decoding step. Therefore, once the source sequence is encoded, it is avoided. The results show that this can speed up reasoning. At the same time, in the toy data set and WMT translation data set, the method has achieved similar performance with the standard attention mechanism. The results also show that as the sequence becomes longer, the mechanism can achieve more acceleration. Finally, through the visualization score, the researchers validated that the technique was able to learn meaningful comparisons and that different attention context vectors focused on different parts of the source.


The picture above shows the structure of the method and the standard attention model. The K attention vectors are predicted at the coding stage and these predictions are linearly combined at the decoding stage. In the figure above, K = 3. You can interpret the memory-based attention model as a set of attention contexts that are generated by the standard attention mechanism during the encoding. As in the above figure, K = 3, in this case, in the coding phase to predict all three kinds of attention context, and in the decoding process to learn to select the appropriate attention context, a linear combination. This method saves more computational effort than the Boolean calculation context based on the decoder code.

Experimental results

Toy dataset results:

Due to the reduced computational time complexity, the method can achieve higher performance, especially for those long sequences, or those that can be compactly represented as a fixed-size memory matrix. In order to study the trade-off between speed and performance, the researchers compared the performance of the method and the standard model on the Sequence Copy Task with and without attention.

The following table shows the BLEU scores for the model in the case of different sequence lengths and K. The larger K can compute a complex source representation, and a value of 1 k limits the source to represent a single vector. It can be seen that performance has been increasing with increasing K, depending on the length of the data, and longer sequences require more complex representations. Regardless of whether or not there is a position code, the result is almost the same on the toy data set. Although the ability to express lower, but the method is still the same as the standard mechanism mechanism model to fit the data. Both beat the non-attention baseline with a significant gap. The last column shows that this method can greatly accelerate the reasoning process, with the length of the sequence becomes longer, the reasoning speed gap is getting bigger and bigger.


The left side of the figure shows the learning curve with a sequence length of 200. It can be seen that K = 1 can not fit the data distribution, and K & isin {32,64} is almost as fast as the attention-based model. The greater the K can lead to faster convergence rates, the smaller K performance is similar to the non-attention baseline. The figure on the right shows the effect of changing the encoder and decoder scoring functions between softmax and sigmoid. All combinations can fit the data, but some convergence is faster than others.


Machine Translation Dataset Result:

Next, the researchers tested whether the memory-based attention method could fit complex real data sets. Researchers used WMT & rsquo; 15 4 large machine translation datasets: English-Czech, EnglishGerman, English-Finish, and English-Turkish.


The table shows that the model has a faster decoding speed even on large, complex data sets with 16K vocabulary. The time is actually the entire time on the verification set to measure the decoding time, not including the model setting and the data read time, for the average time to run 10 times. The average sequence length in the data is 35, and for other tasks with longer sequence lengths, the method should have a more significant speed increase.


Left: en-fi training curve right: en-tr training curve


The figure above shows the effect of using the sigmoid and softmax functions in encoders and decoders. Softmax / softmax performance of the worst, the other combination of performance is almost equivalent.

Visual Attention:


The figure above shows the attention score in each step of decoding each sample in a toy data set with a sequence length of 100. (Y axis: source symbol; x axis: target symbol)


The above figure shows the attention fraction of each step of the decoding of K = 4 on the sample with the sequence length of 11 (y axis: source; x axis: target)


The figure above shows the attention score of the en-de WMT translation task for each step under the model using the sigmoid scoring function and K = 32. The left subgraph shows each individual attention vector, and the right subgraph shows a complete combination of attention.

If you want to know more about this method please read the original paper:Https://arxiv.org/abs/1707.00110Lei Feng network compiler

China IT News APP

Download China IT News APP

Please rate this news

The average score will be displayed after you score.

Post comment

Do not see clearly? Click for a new code.

User comments

Related news