BERT
Paper Title | BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding |
Authors | Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova |
Date | 2019-05 |
Link | https://arxiv.org/abs/1810.04805 |
Paper summary
Paper Review
Short Summary
The author introduce BERT, a bidirectional, transformer-based pretrained language representation model. The model architecture is similar to the original transformer, except that a special token (CLS - classification) is also added to obtain a sentence-level embedding. Pretraining leveraged two tasks: Masked word prediction and Next sentence classification, then fine-tuning is applied for various tasks by fixing the base model and train additional layers. The paper present outstanding results on various benchmarks and thorough ablation study which shed lights into the model’s inner working.
Strengths
- The bidirectional architecture and a suitable pretraining tasks, each was not novel on their own but in combination has created a very powerful text encoder model.
- Important concepts such as the classification token CLS is pivotal for the development of the fields and was adopted by later works.
- Having an ablation study is always a good detail.
- The code and the pretrained model was released, also the fine-tuning pipeline is highly applicable to various text mining tasks that the research community can leverage.
Weaknesses
- I believe the training of this model is more difficult than what the authors make it seems to be. For example, the choice of pretrain-tasks and how they ended up with those (possibly via multiple iteration) two was not properly discussed.
- It is unclear to what degree the performance improvement owe to the model architecture vs the scaling of the training dataset.
- Does not tackle text-generation task
Reflection
BERT serve as a major model for text encoder, there are a lot of follow up work such as RoBERTa, DistillBERT, etc. that can be examine.
Most interesting thought/idea from reading this paper
Classification token seems to be unreasonably effective in encoding the information of the entire sentence for down-stream task. I know that some e-commerce companies leverage the same technique for customer embedding, product embedding. My initial exploration show that this approach is not very well-explored in this space yet. It’d be interesting to a proper study on a large enough dataset.