Review LLaMA paper

2 minute read

Paper Title	LLaMA: Open and Efficient Foundation Language Models
Informal name	LLaMA paper
Date	2023-02
Link	https://arxiv.org/abs/2302.13971

Paper reading notes

Spontaneous Questions

Q1: What are all the tricks employed in LLaMA?
A1: LLaMA employed an array of tricks which was developed by the community after AIAYN (Transformer’s paper) to improve the model’s performance:

Architecture:
- Pre-normalization [GPT-3]: Stablize training
- SwiGLU activation [PaLM]: Helps model learns better with Gated activation and Swish.
- Rotary Embeddings [GPTNeo]: Helps represent embedding as relative distance
Optimizer:
- AdamW instead of Adam: Less dependent on learning rate
Efficient engineering:
- xformers’ multi-headed causal attention: O(1) memory requirements instead of O(n^2)
- Re-implementing backward instead of using Autograd to check-pointing activation and reduce all_reduce operation for multi-GPU training

Discussion questions

Q1: Some contribution they claim: (1) Trained on publicly available data; Why is this important to us?
A1: (1) Reproducibility; (2) Transparency: Make sure the claims are correct; Vet the system better; In general,

Q2: They claim to make it more inference efficient (smaller model, train longer). Why is that important?
A2: Since it’s open, inference cost would out-weight training cost.

Q3: A lot of small architecture changes, what do you want to see in such paper?
A3: Ablation study, but it would be too costly.

Q4: What other ways you can do ablation study? A4:

Study on a smaller model or a smaller scale experiment
A subset of data
See how fast or slow the model learns in the first 1M tokens

Q5: It’s fairly uncommon to study on all benchmarks. Why do they do this? A5: They claim to be a foundation model, only using public data -> Trade off: Some paper focus on new architecture, this one focus on creating a new “platform” for open LLM use and finetuning.

Paper Review

Summary

Built on the development of the latest LLM models (GPT-3, PaLM, Chinchilla), the authors introduce LLaMA, a collection of pretrained foundation LM and their weights, which outperform their peers on a range of standard benchmarks. Their main contributions include composing an open-source compatible training dataset, combining various ideas in architecture (Pre-normalization, SwiGLU activation, Rotary Embedding), Optimization (AdamW), and Efficient engineering (O(1) memory self attention, faster multi-GPU backward). The paper also evaluate LLaMA and benchmark against other models on a wide range of tasks from reasoning, reading comprehension, q&a, code generation, to Instruction Finetuning, Bias and Safety.

Strengths

The model weights are released, which is useful for the research community to evaluate, replicate, fine-tune on the foundational model.
The experiments seems to be sufficient and comprehensive, proving the model’s performance compared to peers.
The training using only publicly-available data is novel compared to its peers from other AI labs

Weaknesses

No novel ideas in terms of architecture, loss function, or optimization
Does not provide deep-dive findings into Scaling laws or the training process of the model. This might create the illusion that training is easy while it’s likely to be very challenging in reality.
Does not investigate how the improvements in architecture affect the model’s performance.
Could do better in terms of describing how they prepare the dataset, which in my opinion is one of the main contribution of this paper.

Reflection

The Transformer architecture basically didn’t doesn’t change in the last 5 years, only a few tweaks applied to make the training more stable and more efficient.
With the released model weights, which is small enough, there’s a lot of opportunities for fine-tuning (Full, LoRA) to downstream tasks

Most interesting thought/idea from reading this paper

I can try finetuning LlaMA myself using LoRA with a single T4 GPU

Share on

Twitter Facebook LinkedIn

Review LLaMA paper

Paper reading notes

Spontaneous Questions

Discussion questions

Paper Review

Summary

Strengths

Weaknesses

Reflection

Most interesting thought/idea from reading this paper

Share on

You may also enjoy

Key Concepts Of Langchain

A Beginner Introduction To Ranking Model

Observation Vs Ground Truth And Why Data Analysts Are Important

Applying Sequence Classification To Grocery Data Using Product Embeddings