Plan:
- Guess what the paper is about
- LLM summary
- YouTube if needed
- Read over paper, take notes on nuances
My Guess:
Past papers like deep nets for convnet discussed how you could just keep adding layers and the model would perform better. This paper probably discusses the nature of scaling language models. (This paper is in 2020, not too long ago. In 2024 it seems like the current sentiment is that you can just keep making networks bigger, and train them more, and they keep getting better). Whoa Benjamin CHESS wrote this??? Dude I love chess, huge fan of this guy's work.
LLM Summary Notes:
- Language model performance scales as a power law wrt dataset size, compute for training, and model size (num parameters)
- I assume this means it scales linearly wrt log(resource)
- Predictable, except for dataset size. That's a new idea to me....
- Model shape doesn't matter
- Performance scales wrt num of parameters, whether the network is wide or deep.
- Larger models are more sample efficient - they achieve the same performance WITH FEWER OPTIMIZATION STEPS AND DATA POINTS - so to make a better generalizer, add way more params? wtf? I feel like I'm misinterpreting this because intuitively that sounds wrong.. if I had a model with 1e100 parameters and I put in 10 data points, wouldn't it just memorize all of them? but wait.. maybe the memorization is fine, because what actually matters wrt overfitting is the performance on the population set, and maybe these networks perform great on the population set with only training on a small sample of, say, 10 data points that they end up memorizing...
- There are simple equations describing overfitting as a func of model size and dataset size (cool, not obvious)
- There are also EQs for the dependence of training speed on model size (obvious)
- For a fixed compute budget, optimal training: good amount of data + stopping model before convergence. Compute-efficient and sample-efficient
- Model has transfer / generalization improves at a similar rate to training performance
- Larger models are more sample efficient (this is super super cool to me, I wonder what the limits of this are
- model_performance(params=1e1000, dataset_size=x) = model_performance(params=1e6, dataset_size=1e6), solve for x, yields what? there are probably EQs in the paper for this
- Batch size is a function of the loss and can be determined by measuring the gradient noise scale (wtf does this mean?)
- Paper suggests that these scaling laws apply to image, audio, video models
My Thoughts:
The authors are OpenAI. Speculation: this paper has found justification for more investment into training larger models. -> a year later we get ChatGPT
YouTube Notes:
Video: YouTube Video
- N^0.74 / D (N=parameters, D=data)
- Data is more valuable than 1 extra param
- nxm problem: optimal solution is n=m
- Gpipe splits models depth wise
- Idk didn't really add anything, I wouldn't recommend watching the video
Paper Notes:
- Ok so they actually did do this on transformers
- N (model size) and D need to be increased together, otherwise you get overfitting
- Training curve can be extrapolated to roughly predict the loss that would be achieved by training for longer (wow!) simple but cool
- Transfer improves the more train performance improves (eval of model on distribution with different data set)
- Convergence is a waste of time and money. For a set amount of $$, make the model bigger and stop earlier (this is actually crazy useful knowledge wtf?)
- Loss can be predicted for given dataset size, param size, compute budget
- You can also calculate the compute needed to reach a certain loss
- Comparison of compute budget spend: convergence vs authors' method
- Typical researchers train to convergence
- Authors stop before loss but make the models bigger
- To achieve the same loss, the authors' early stopping method requires 65% LESS COMPUTE (!!!!!)