Anna Vander Stel photos posted on the Unsplash.
The results of the Quest Q & A Labeling Competition held by Google recently revealed that a team of four, composed of Dmitry Danevskiy、Yury Kashnitsky、Oleg Yaroshevskiy and Dmitry Abulkhanov ," Bibimorph」 won the championship.
In the Q & a tag competition, participants were asked to build prediction algorithms for different Q & A. The data set provided by the competition includes thousands of Q & A pairs, most of which are from stackexchange. Q & A pairs are labeled to reflect the quality of the question and whether the answer is relevant. The results of the competition will help to promote the development of the Q & a system.
During the Winners interview, the Bibimorph team announced their unique ways to address this challenge.
Q1: please share your background, including how did you start the journey of kaggle?
I have a background in Applied Mathematics and physics. Four years ago, I began to focus on machine learning in industrial applications. For two years, I have been working in a small AI service company, responsible for segmentation time series, face recognition, voice processing and the realization of complex deep learning solutions. Now I work for a start-up company called respeecher, and I'm one of the chief engineers in audio synthesis frontier research.
I have participated in different Kaggle competitions for many years, and I think modern deep learning methods are very common and can be applied to almost any unstructured and structured data. This may sound controversial, but I have used experience to prove that it can win gold medals in image, text, audio and tabular data competitions. And it's really important for me to win Google Quest quiz
I'm a PhD in physics and applied mathematics. As a child, I was keen on aviation technology, so I entered the Moscow Institute of Physics and Technology aviation major study. Then I started programming and learned to use Python to process databases for VR applications. After a few years of work related to database and business intelligence, I returned to school and began a full-time doctoral program in applied mathematics. Later I went to the Russian IT giant Mail.Ru Group as the company's first data scientist, currently living and working in the Netherlands, mainly in the Netherlands National Laboratory R & D work. Over the past 3 years, I have been leading ML course.ai, this is an open ML course, very focused on Kaggle competitions.
When it comes to kaggle, I have experienced a long process of learning, struggling and relearning. It wasn't until two years ago that I started playing NLP seriously. For a long time, when my students at mlcourse.ai won gold medals again and again, I was only lucky to climb to the top of the silver medal district. Winning the Google quest Q & a tag contest finally brought me the long-awaited title of master.
This is Elmo with NVIDIA Quadro p6000 card:
I have a background in Applied Statistics and computer science. As a student, I am very interested in the influence of technology on society. Encouraged by Andrej karpath's famous article "the unreasonable effectiveness of recurrent neural networks", I decided to shift from software engineering to machine learning.
As a research engineer, I have been designing in-depth models for voice processing, machine translation, machine understanding, and other NLP tasks from the beginning. In July 2017, I learned about transformers, which changed my career. I'm passionate about literature and the art of writing. I hope to see the drama created by artificial intelligence one day.
I am a research engineering consultant and active Kaggler. today I believe Kaggle helps build deep intuition behind training deep neural networks and pattern recognition. I encourage others to try the data science contest and join this rapidly growing community of Kaggle enthusiasts.
I studied at the Moscow School of Physics and Technology and the School of Yandex Data Analysis with a background in mathematics and physics. As a student, I participated in many data science hacking competitions. From these competitions, I draw a conclusion: as long as there is enough time, there is no problem that can not be solved. I believe participating in the competition can provide useful expertise to solve various problems in data science.
At present, I am a NLP researcher in Huawei.
Q2: how is your team organized and coordinated?
During the TensorFlow 2.0 Q & A challenge, we lost the title. to prove ourselves, we participated in the Google QUEST question and answer label competition.
Fortunately, the format of this competition is the same as the two coding competitions we participated in before! So, in the first two to three weeks of Google quest Q & a tag competition, although there are many questions that make other participants crazy, they are very easy for us.
Four of us merged, and Dmitry a proposed a powerful technology for language model pre training using stack exchange data.
Oleg from a simple pytorch baseline based on a common notebook（https://www.kaggle.com/phoenix9032/pytorch-bert-plainAt first, he trained BART models. Dmitriy A. and Yury mainly study the pre-training language model. Dmitriy D. lead team training model, develop verification scheme and model mixing scheme.
We think teamwork is a valuable experience, and winning the championship is the result of everyone's efforts.
Q3: what is the most important discovery of your team?
In short:Transfer learning。 Considering that we have a very small public data set in this competition, it is crucial to make good use of a large number of unmarked data.
But actually, we have three main tips:
Language model pre training
Post processing forecast
Secret 1: language model pre training
We used about 7 million stackexchange problems to fine tune the Bert language model with a masking language model task (MLM) and an additional sentence order prediction task.
Besides, we set up additional complementary goals: while fine-tuning the LM, we also forecast five indicators
The reason we use custom extended vocabularies is simple: stackexchange problems often involve not only pure spoken language, but also math and code. Extending the vocabulary with latex symbols, mathematical formulas, and partial code snippets helps capture this fact.
In general, LM pre training plays a key role in improving our model:
Transfer learning. Our model has "seen" more than 10 times the data before training with contest data.
Domain adaptation. Our pre training model is better adapted to the data at hand due to LM's fine tuned custom vocabulary and secondary goals.
Tip 2: Fake Labels
Pseudo tagging used to be a hot topic in kaggle, but now it has become a well-known common technology.
Photo source: "pseudo labeling a simple semi supervised learning method" by vinko Kod У Oman
This idea is summarized in the figure above. For more information, see the tutorial mentioned above. In short, for some untagged datasets, model prediction can be used as a "pseudo label" to extend the labeled training datasets.
We use pseudo tags from 20K and 100k samples of stackexchange problem dumps to improve three-quarters of the training model.
Secret 3: post processing prediction
The criteria chosen for the competition is the Spearman Association. For each of the 30 target tags, the Spearman correlation between the prediction and the real value is calculated. Then 30 Spearman correlation coefficients are averaged to produce the final measurement.
As in this article about kaggle（https://www.kaggle.com/c/google-quest-challenge/discussion/118724）As observed in, Spearman correlation is sensitive to the equivalence of some predictions:
The above example shows that prediction vector B can "thresholding" generate B2, thus increasing its Spearman association with a (true value) from 0.89 to 1.
In fact, this is one of the disadvantages of the whole competition
Instead of thresholding the prediction, we discretize the prediction according to the distribution of the training set. The idea is to match the prediction distribution of the specific target column with the corresponding distribution of the corresponding column in the training data set. For additional details, see our shared solution code:https://github.com/oleg-yaroshevskiy/quest_qa_labeling/blob/yorko/step11_final/blending_n_postprocessing.py#L48。
Q4： what is your final solution like?
our baseline model is almost a linear layer above the hidden state of the vanilla BERT with the mean pool. As for the input, we only passed the question title, question body and answer body separated by special marks.
In addition to the three "secrets" described above, there are also some techniques, including softmax normalized weights of hidden states and multi sample loss in all Bert layers.
The final solution is to combine the folding prediction of four models (two Bert based models, one Roberta based model and one large Bart) with the three "secrets" mentioned above: pre training language model, pseudo tag and post-processing prediction.
Q5: what did you learn from this competition?
We learned a lot!
Don't play too early. First, lay a good technical foundation.
for small training datasets, the emphasis is on utilizing additional big datasets in an appropriate way.
Transfer learning is not only suitable for computer vision tasks, but also important in natural language processing.
In the case of small training datasets, pay particular attention to validation.
Look for teammates who can bring diversity to the final solution in terms of skills, methods, models, etc.
Do you have any suggestions for those who are just beginning to learn data science?
We can summarize Yury suggestions from the video "How to jump into Data Science」"(https://www.youtube.com/watch?v=FGuGg9F2VUs).
There are eight main steps:
Python. learn the basics of this programming language through Kaggle Learn、Dataquest、codearcademy or similar tools. Primary data scientists hardly need advanced Python skills, but using Python at work is good.
SQL language. Learn the basics, and kaggle learn can do the same. Update your SQL skills before the interview, and the rest will be learned at work.
Mathematics. Basic knowledge of calculus, linear algebra, statistics, etc. is essential for understanding the tool set to be used. Open MIT courses may be the best resource.
Algorithm. How much algorithm is needed is a controversial issue, but you can learn classic courses of R. Sedgewick and t. rough garden, and leetcode will also help.
Development skills. Software engineering background is preferred. The word "ML engineer" is actually much more popular now than "data scientist" because the business is not running on Jupyter notebook and you have to deploy it to production. You'd better at least know how to use git and Docker. anyway
Machine learning. Basic ml courses are included in mlcourse.ai. Some coursera majors will also be a good entry point. As for deep learning, cs231n or fast.ai from Stanford University are two good choices.
Events or competitions. This is a good proof that you have made a minimum feasible product. You can learn a lot through the practice program. It's a good choice to play, but don't take the game mentality and make the most of what you've learned in kaggle.
Interview. Don't just sit at home and study. Do some interview exercises. Try, fail, learn and iterate. You will succeed one day.
Solutions shared on GitHub:https://github.com/oleg-yaroshevskiy/quest_qa_labeling/tree/yorko
Kaggle notebook reproduces the reasoning part of the solution:https://www.kaggle.com/ddanevskyi/1st-place-solution
Champion solutions on kaggle:https://www.kaggle.com/c/google-quest-challenge/discussion/129840