7x70

Plan:

Guess what the paper is about
LLM summary
YouTube if needed
Read over paper, take notes on nuances

My Guess:

Past papers like deep nets for convnet discussed how you could just keep adding layers and the model would perform better. This paper probably discusses the nature of scaling language models. (This paper is in 2020, not too long ago. In 2024 it seems like the current sentiment is that you can just keep making networks bigger, and train them more, and they keep getting better). Whoa Benjamin CHESS wrote this??? Dude I love chess, huge fan of this guy's work.

LLM Summary Notes:

Language model performance scales as a power law wrt dataset size, compute for training, and model size (num parameters)
- I assume this means it scales linearly wrt log(resource)
- Predictable, except for dataset size. That's a new idea to me....
Model shape doesn't matter
- Performance scales wrt num of parameters, whether the network is wide or deep.
Larger models are more sample efficient - they achieve the same performance WITH FEWER OPTIMIZATION STEPS AND DATA POINTS - so to make a better generalizer, add way more params? wtf? I feel like I'm misinterpreting this because intuitively that sounds wrong.. if I had a model with 1e100 parameters and I put in 10 data points, wouldn't it just memorize all of them? but wait.. maybe the memorization is fine, because what actually matters wrt overfitting is the performance on the population set, and maybe these networks perform great on the population set with only training on a small sample of, say, 10 data points that they end up memorizing...
There are simple equations describing overfitting as a func of model size and dataset size (cool, not obvious)
There are also EQs for the dependence of training speed on model size (obvious)
For a fixed compute budget, optimal training: good amount of data + stopping model before convergence. Compute-efficient and sample-efficient
Model has transfer / generalization improves at a similar rate to training performance
Larger models are more sample efficient (this is super super cool to me, I wonder what the limits of this are
- model_performance(params=1e1000, dataset_size=x) = model_performance(params=1e6, dataset_size=1e6), solve for x, yields what? there are probably EQs in the paper for this
Batch size is a function of the loss and can be determined by measuring the gradient noise scale (wtf does this mean?)
Paper suggests that these scaling laws apply to image, audio, video models

My Thoughts:

The authors are OpenAI. Speculation: this paper has found justification for more investment into training larger models. -> a year later we get ChatGPT

YouTube Notes:

Video: YouTube Video

N^0.74 / D (N=parameters, D=data)
- Data is more valuable than 1 extra param
- nxm problem: optimal solution is n=m
Gpipe splits models depth wise
Idk didn't really add anything, I wouldn't recommend watching the video

Paper Notes: