2 minute read

Paper Title LLaMA: Open and Efficient Foundation Language Models
Informal name LLaMA paper
Date 2023-02
Link https://arxiv.org/abs/2302.13971

Paper reading notes

Spontaneous Questions

Q1: What are all the tricks employed in LLaMA?
A1: LLaMA employed an array of tricks which was developed by the community after AIAYN (Transformer’s paper) to improve the model’s performance:

  • Architecture:
    • Pre-normalization [GPT-3]: Stablize training
    • SwiGLU activation [PaLM]: Helps model learns better with Gated activation and Swish.
    • Rotary Embeddings [GPTNeo]: Helps represent embedding as relative distance
  • Optimizer:
    • AdamW instead of Adam: Less dependent on learning rate
  • Efficient engineering:
    • xformers’ multi-headed causal attention: O(1) memory requirements instead of O(n^2)
    • Re-implementing backward instead of using Autograd to check-pointing activation and reduce all_reduce operation for multi-GPU training

Discussion questions

Q1: Some contribution they claim: (1) Trained on publicly available data; Why is this important to us?
A1: (1) Reproducibility; (2) Transparency: Make sure the claims are correct; Vet the system better; In general,

Q2: They claim to make it more inference efficient (smaller model, train longer). Why is that important?
A2: Since it’s open, inference cost would out-weight training cost.

Q3: A lot of small architecture changes, what do you want to see in such paper?
A3: Ablation study, but it would be too costly.

Q4: What other ways you can do ablation study? A4:

  • Study on a smaller model or a smaller scale experiment
  • A subset of data
  • See how fast or slow the model learns in the first 1M tokens

Q5: It’s fairly uncommon to study on all benchmarks. Why do they do this? A5: They claim to be a foundation model, only using public data -> Trade off: Some paper focus on new architecture, this one focus on creating a new “platform” for open LLM use and finetuning.

Paper Review

Summary

Built on the development of the latest LLM models (GPT-3, PaLM, Chinchilla), the authors introduce LLaMA, a collection of pretrained foundation LM and their weights, which outperform their peers on a range of standard benchmarks. Their main contributions include composing an open-source compatible training dataset, combining various ideas in architecture (Pre-normalization, SwiGLU activation, Rotary Embedding), Optimization (AdamW), and Efficient engineering (O(1) memory self attention, faster multi-GPU backward). The paper also evaluate LLaMA and benchmark against other models on a wide range of tasks from reasoning, reading comprehension, q&a, code generation, to Instruction Finetuning, Bias and Safety.

Strengths

  • The model weights are released, which is useful for the research community to evaluate, replicate, fine-tune on the foundational model.
  • The experiments seems to be sufficient and comprehensive, proving the model’s performance compared to peers.
  • The training using only publicly-available data is novel compared to its peers from other AI labs

Weaknesses

  • No novel ideas in terms of architecture, loss function, or optimization
  • Does not provide deep-dive findings into Scaling laws or the training process of the model. This might create the illusion that training is easy while it’s likely to be very challenging in reality.
  • Does not investigate how the improvements in architecture affect the model’s performance.
  • Could do better in terms of describing how they prepare the dataset, which in my opinion is one of the main contribution of this paper.

Reflection

  • The Transformer architecture basically didn’t doesn’t change in the last 5 years, only a few tweaks applied to make the training more stable and more efficient.
  • With the released model weights, which is small enough, there’s a lot of opportunities for fine-tuning (Full, LoRA) to downstream tasks

Most interesting thought/idea from reading this paper

I can try finetuning LlaMA myself using LoRA with a single T4 GPU

Updated: