Xiao Xiao comes from aofei temple
Qubit report official account number QbitAI
In the latest news, Google has launched big bird, a new member of the NLP series Sesame Street.
This cartoon bird, which looks a little cute in the eyes of the outside world, has changed its mind and solved the problem brought about by the full attention mechanism in the Bert modelSequence length quadratic dependence restrictionCan take into account a longer context.
Big bird in Sesame Street
As we all know, Google's Bert was once called the "strongest on the surface" NLP model.
And Bert has the same name as the fictional characters in Sesame Street, a well-known American cartoon.
Previously, Google's "Sesame Street" series already has five members (see portal for the link to the paper). Now the arrival of big bird means that Google has further studied NLP.
A little Elmo
Let's see what big bird does.
Break through the limitation of total attention mechanism
Some of the best deep learning models in NLP, such as Bert, are based onTransformerAs a model of feature extractor, this model has its limitations, one of the core isTotal attention mechanism。
This mechanism will bring the sequence length quadratic limitation, mainly in storage.
To solve this problem, the team proposed a sparse attention mechanism called big bird.
AsLonger sequenceOn transformers, big bird uses sparse attention mechanism to reduce secondary dependence to linear.
The following image shows the construction of attention mechanism module used by big bird.
Among them, the white part represents the vacancy of attention.
Graph (a) represents the random attention mechanism with r = 2, graph (b) represents the local attention mechanism with w = 3, graph (c) represents the global attention mechanism with g = 2, and graph (d) is a big bird model which integrates the first three.
The reason why this model is proposed is that the team hopes to reduce the secondary dependence to linear, and at the same time, big bird's model can be as close as possible to and maintain the indicators of the Bert model.
As shown in the figure below, the combination of random attention mechanism, local attention mechanism or the combination of the three mechanisms is not effective.
that is to say, the attention mechanism fusion of random local global is close to the BERT-base indexes to the greatest extent.
Moreover, part of this sparse attention mechanism also includes global token with O (1), such as CLS.
O (N) is this part of the long-range attention overhead
NLP Q & A and summary tasks go beyond SOTA
The four models were trained with books, CC news, stories and Wikipedia data sets. According to the reserve method, the loss of BIGBIRD etc was the lowest.
From the result, big bird is inQ & AThe accuracy shown in the task is very good.
The following figure shows the precision effect of big bird compared with Roberta and longform. It can be seen that the two models of BIGBIRD show higher accuracy in various data and.
After fine tuning the model, we can see that BIGBIRD etc has surpassed SOTA in the sup of hotpotqa, La of naturalq, verified and wikihop of triviaqa.
At the same time, big bird'sabstractThe performance in the task is also quite brilliant.
Abstract, as the name implies, is to extract the core idea and significance of this paragraph from a long text. Here are the results tested from three long article datasets, arXiv, PubMed, and bigpatent.
From the figure, compared with other very advanced NLP models, BIGBIRD greatly improves the accuracy of the summary task and performs very well.
Not only that, big bird proved to beTuring Complete This means that big bird can calculate any problem that can be calculated. In theory, it can be used to solve any algorithm.
In addition, big bird has great potential in genome data processing.
However, some netizens believe that such a model is not fundamentally different from longform in concept and can not be regarded as a big breakthrough.
What's your opinion?
Introduction to the author
The two co authors of the paper are manzil zaheer and guru guruganesh, both from Google.
△ Manzil Zaheer
Manzil zaheer, Ph.D., CMU machine learning, has published three papers on nips. In addition, he has also published corresponding articles at ACL and emnlp top conferences.
△ Guru Guruganesh
Guru guruganesh, Ph.D. in CMU machine learning, mainly studies approximation algorithm, Ramsey theorem, positive semidefinite programming, etc.
"Sesame Street" series papers list: