BERT

1 minute read

Paper Title	BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Authors	Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
Date	2019-05
Link	https://arxiv.org/abs/1810.04805

Paper summary

Paper Review

Short Summary

The author introduce BERT, a bidirectional, transformer-based pretrained language representation model. The model architecture is similar to the original transformer, except that a special token (CLS - classification) is also added to obtain a sentence-level embedding. Pretraining leveraged two tasks: Masked word prediction and Next sentence classification, then fine-tuning is applied for various tasks by fixing the base model and train additional layers. The paper present outstanding results on various benchmarks and thorough ablation study which shed lights into the model’s inner working.

Strengths

The bidirectional architecture and a suitable pretraining tasks, each was not novel on their own but in combination has created a very powerful text encoder model.
Important concepts such as the classification token CLS is pivotal for the development of the fields and was adopted by later works.
Having an ablation study is always a good detail.
The code and the pretrained model was released, also the fine-tuning pipeline is highly applicable to various text mining tasks that the research community can leverage.

Weaknesses

I believe the training of this model is more difficult than what the authors make it seems to be. For example, the choice of pretrain-tasks and how they ended up with those (possibly via multiple iteration) two was not properly discussed.
It is unclear to what degree the performance improvement owe to the model architecture vs the scaling of the training dataset.
Does not tackle text-generation task

Reflection

BERT serve as a major model for text encoder, there are a lot of follow up work such as RoBERTa, DistillBERT, etc. that can be examine.

Most interesting thought/idea from reading this paper

Classification token seems to be unreasonably effective in encoding the information of the entire sentence for down-stream task. I know that some e-commerce companies leverage the same technique for customer embedding, product embedding. My initial exploration show that this approach is not very well-explored in this space yet. It’d be interesting to a proper study on a large enough dataset.

Share on

Twitter Facebook LinkedIn

BERT

Paper summary

Paper Review

Short Summary

Strengths

Weaknesses

Reflection

Most interesting thought/idea from reading this paper

Share on

You may also enjoy

Key Concepts Of Langchain

A Beginner Introduction To Ranking Model

Observation Vs Ground Truth And Why Data Analysts Are Important

Applying Sequence Classification To Grocery Data Using Product Embeddings