Home > News content

Google's new model breaks through the limitations of Bert: big bird, a new member of Sesame Street in NLP

via:博客园     time:2020/8/6 23:18:46     readed:182

Xiao Xiao comes from aofei temple

Qubit report official account number QbitAI

In the latest news, Google has launched big bird, a new member of the NLP series Sesame Street.

This cartoon bird, which looks a little cute in the eyes of the outside world, has changed its mind and solved the problem brought about by the full attention mechanism in the Bert modelSequence length quadratic dependence restrictionCan take into account a longer context.

data-ratio=0.75

Big bird in Sesame Street

As we all know, Google's Bert was once called the "strongest on the surface" NLP model.

And Bert has the same name as the fictional characters in Sesame Street, a well-known American cartoon.

Previously, Google's "Sesame Street" series already has five members (see portal for the link to the paper). Now the arrival of big bird means that Google has further studied NLP.

data-ratio=0.75

A little Elmo

Let's see what big bird does.

Break through the limitation of total attention mechanism

Some of the best deep learning models in NLP, such as Bert, are based onTransformerAs a model of feature extractor, this model has its limitations, one of the core isTotal attention mechanism

This mechanism will bring the sequence length quadratic limitation, mainly in storage.

To solve this problem, the team proposed a sparse attention mechanism called big bird.

AsLonger sequenceOn transformers, big bird uses sparse attention mechanism to reduce secondary dependence to linear.

The following image shows the construction of attention mechanism module used by big bird.

Among them, the white part represents the vacancy of attention.

Graph (a) represents the random attention mechanism with r = 2, graph (b) represents the local attention mechanism with w = 3, graph (c) represents the global attention mechanism with g = 2, and graph (d) is a big bird model which integrates the first three.

data-ratio=0.2802325581395349

The reason why this model is proposed is that the team hopes to reduce the secondary dependence to linear, and at the same time, big bird's model can be as close as possible to and maintain the indicators of the Bert model.

As shown in the figure below, the combination of random attention mechanism, local attention mechanism or the combination of the three mechanisms is not effective.

that is to say, the attention mechanism fusion of random local global is close to the BERT-base indexes to the greatest extent.

data-ratio=0.5109780439121756

Moreover, part of this sparse attention mechanism also includes global token with O (1), such as CLS.

O (N) is this part of the long-range attention overhead

NLP Q & A and summary tasks go beyond SOTA

The four models were trained with books, CC news, stories and Wikipedia data sets. According to the reserve method, the loss of BIGBIRD etc was the lowest.

data-ratio=0.25729166666666664

From the result, big bird is inQ & AThe accuracy shown in the task is very good.

The following figure shows the precision effect of big bird compared with Roberta and longform. It can be seen that the two models of BIGBIRD show higher accuracy in various data and.

data-ratio=0.24542682926829268

After fine tuning the model, we can see that BIGBIRD etc has surpassed SOTA in the sup of hotpotqa, La of naturalq, verified and wikihop of triviaqa.

data-ratio=0.39783491204330174

At the same time, big bird'sabstractThe performance in the task is also quite brilliant.

Abstract, as the name implies, is to extract the core idea and significance of this paragraph from a long text. Here are the results tested from three long article datasets, arXiv, PubMed, and bigpatent.

data-ratio=0.5482625482625483

From the figure, compared with other very advanced NLP models, BIGBIRD greatly improves the accuracy of the summary task and performs very well.

Not only that, big bird proved to beTuring Complete This means that big bird can calculate any problem that can be calculated. In theory, it can be used to solve any algorithm.

In addition, big bird has great potential in genome data processing.

However, some netizens believe that such a model is not fundamentally different from longform in concept and can not be regarded as a big breakthrough.

data-ratio=0.2188034188034188

What's your opinion?

Introduction to the author

The two co authors of the paper are manzil zaheer and guru guruganesh, both from Google.

data-ratio=1.3333333333333333

△ Manzil Zaheer

Manzil zaheer, Ph.D., CMU machine learning, has published three papers on nips. In addition, he has also published corresponding articles at ACL and emnlp top conferences.

data-ratio=1.0416666666666667

△ Guru Guruganesh

Guru guruganesh, Ph.D. in CMU machine learning, mainly studies approximation algorithm, Ramsey theorem, positive semidefinite programming, etc.

Portal

"Sesame Street" series papers list:

finish

China IT News APP

Download China IT News APP

Please rate this news

The average score will be displayed after you score.

Post comment

Do not see clearly? Click for a new code.

User comments