Review LLaMA paper
Paper Title | LLaMA: Open and Efficient Foundation Language Models |
Informal name | LLaMA paper |
Date | 2023-02 |
Link | https://arxiv.org/abs/2302.13971 |
Paper reading notes
Spontaneous Questions
Q1: What are all the tricks employed in LLaMA?
A1: LLaMA employed an array of tricks which was developed by the community after AIAYN (Transformer’s paper) to improve the model’s performance:
- Architecture:
- Pre-normalization [GPT-3]: Stablize training
- SwiGLU activation [PaLM]: Helps model learns better with Gated activation and Swish.
- Rotary Embeddings [GPTNeo]: Helps represent embedding as relative distance
- Optimizer:
- AdamW instead of Adam: Less dependent on learning rate
- Efficient engineering:
- xformers’ multi-headed causal attention: O(1) memory requirements instead of O(n^2)
- Re-implementing backward instead of using Autograd to check-pointing activation and reduce
all_reduce
operation for multi-GPU training
Discussion questions
Q1: Some contribution they claim: (1) Trained on publicly available data; Why is this important to us?
A1: (1) Reproducibility; (2) Transparency: Make sure the claims are correct; Vet the system better;
In general,
Q2: They claim to make it more inference efficient (smaller model, train longer). Why is that important?
A2: Since it’s open, inference cost would out-weight training cost.
Q3: A lot of small architecture changes, what do you want to see in such paper?
A3: Ablation study, but it would be too costly.
Q4: What other ways you can do ablation study? A4:
- Study on a smaller model or a smaller scale experiment
- A subset of data
- See how fast or slow the model learns in the first 1M tokens
Q5: It’s fairly uncommon to study on all benchmarks. Why do they do this? A5: They claim to be a foundation model, only using public data -> Trade off: Some paper focus on new architecture, this one focus on creating a new “platform” for open LLM use and finetuning.
Paper Review
Summary
Built on the development of the latest LLM models (GPT-3, PaLM, Chinchilla), the authors introduce LLaMA, a collection of pretrained foundation LM and their weights, which outperform their peers on a range of standard benchmarks. Their main contributions include composing an open-source compatible training dataset, combining various ideas in architecture (Pre-normalization, SwiGLU activation, Rotary Embedding), Optimization (AdamW), and Efficient engineering (O(1) memory self attention, faster multi-GPU backward). The paper also evaluate LLaMA and benchmark against other models on a wide range of tasks from reasoning, reading comprehension, q&a, code generation, to Instruction Finetuning, Bias and Safety.
Strengths
- The model weights are released, which is useful for the research community to evaluate, replicate, fine-tune on the foundational model.
- The experiments seems to be sufficient and comprehensive, proving the model’s performance compared to peers.
- The training using only publicly-available data is novel compared to its peers from other AI labs
Weaknesses
- No novel ideas in terms of architecture, loss function, or optimization
- Does not provide deep-dive findings into Scaling laws or the training process of the model. This might create the illusion that training is easy while it’s likely to be very challenging in reality.
- Does not investigate how the improvements in architecture affect the model’s performance.
- Could do better in terms of describing how they prepare the dataset, which in my opinion is one of the main contribution of this paper.
Reflection
- The Transformer architecture basically didn’t doesn’t change in the last 5 years, only a few tweaks applied to make the training more stable and more efficient.
- With the released model weights, which is small enough, there’s a lot of opportunities for fine-tuning (Full, LoRA) to downstream tasks
Most interesting thought/idea from reading this paper
I can try finetuning LlaMA myself using LoRA with a single T4 GPU