On Emergence, Scaling and Inductive Bias
Disclaimer: Views are my own and do not express the opinions of my employer.
Language models are one of the most impactful research directions that drive modern NLP and AI today.
To this end, there have been tons of discussions revolving around how abilities emerge with scale especially pertaining to large language models.
Basically, emergent abilities are capabilities that are present in large models but not in small models. This “emergent phenomena” cannot be predicted at smaller scales but abruptly appears at a larger scale. In short, this looks like a graph that is flat-lined for most part and suddenly spikes at a certain scale. Jason Wei writes an amazing blogpost here recapitulating common emergent abilities.
There is sufficient empirical evidence that emergent abilities are really a thing and they are becoming well-established in the field of scaling large language models.
Fundamentally, emergent abilities are mostly about a scaling curve w.r.t to a certain downstream task and so it is useful to keep a scaling law perspective in mind when discussing emergent abilities.
Up till today, the x-axis in the scaling equation has mostly been about computation and FLOPs, while considering everything else standard, e.g., vanilla causal transformer-based language model. But what if other variables are altered?
There are couple of perspectives and camps about this. One of the major camps believe in:
Scale is all you need.
This perspective postulates that in the grand scheme of things, only scale matters. It doesn’t matter what the underlying model is, i.e., you could scale a 1 trillion parameter MLP and it would do very well. This perspective also tries to downplay the effect of “inductive biases” or specialized architectures.
I get where this is coming from, given that even after 6 whole years, the vanilla Transformer still holds up quite well as the “best defacto sequence model”. Many modifications to Transformers do not scale, also empirically demonstrated by our paper on Scaling Laws vs Model Architectures.
Before moving on, it would be great to clear the air on:
Inductive bias?
Seemingly apparent on scientific Twitter and perhaps anecdotally, it appears to me some folks do not vibe well with the term “inductive bias”. Sometimes, some folks really do not like this term because it’s vague and ambiguous. Or they could just be ‘biased’ towards it (no pun intended). Every time I try to answer a question with “inductive bias” as an answer, I get polarized responses. Some people get it and some don’t, but it’s fine. But as a hardcore empirical deep learning person, the term inductive bias is perfect to describe “the one or few things that you did to the model to enable to do much better on some downstream task”. It’s hard to explain or find a bucket term to describe the conv-thingie or attention flowy thingie, so let’s just use “inductive bias” for now.
It also seems like the ability to comprehend the mystical inductive bias argument is an emergent ability, where after training models for N hours and starting at M tensorboard logs, one will, in emergent fashion, acquire the ability to understand the term inductive bias in a very special way that they will not be able to explain what it means to anyone else. :)
How are convolutions different from self-attention? Inductive bias. How are masked language models different from causal language models? Inductive bias. How are encoder-decoders different from decoder-only models? Inductive bias…
Now that I’m done with my few-shot prompting exercise. grounded scientific explanation, I assume the reader is now well-conditioned.
So what about emergence?
In our recent paper “Transcending Scaling Laws with 0.1% Extra Compute”, we continued training PaLM with the UL2’s mixture-of-denoisers and show that this drastically change the scaling curves from 8B->62B->540B on a suite of hard BIG-Bench tasks. UL2 proposes a mix of objectives such as prefix language modeling and long-short span corruption that complements the original causalLM objective). Again, this clearly shows the impact of how a different inductive bias (bidirectional attention plus a series of bidirectional pretraining span corruption tasks) can impact the acquisition of new capabilities.'
The figure below (snapped from above mentioned paper) shows how U-PaLM 8B or U-PaLM 62B can match PaLM 540B sometimes on some tasks just by a simple change in inductive bias. Moreover, this shows “early emergence” or emergent abilities at a much smaller scale than purely scale-induced emergence.
Furthermore, in “Inverse scaling can become U-shaped”, we show how chain-of-thought prompting defends against inverse scaling. The inverse scaling prize recently showed that some tasks became worse as we scale up our models. However, using CoT prompting defends against this type of “inverse scaling”. Are new prompting methods inductive biases? I would think so. Hence, this reinforces the argument that scaling laws are tightly coupled with inductive bias and here prompting techniques are the inductive biases.
The TL;DR is that inductive biases can significantly influence scaling and this is probably good and bad news. The good? We are probably not at the maximum potential yet and new advances can completely be a game-changer. The bad? The search space just became insane.
Overall, this is a nascent research area and it is challenging for many obvious reasons. Firstly, the scaling properties of different inductive biases is difficult to predict and it is costly to evaluate scaling curves for every new idea. Hence, new ideas are mostly and often presented as a single point on a scaling curve. This is scary because ideas that are excellent at small scale could fail miserably at larger compute regions. Conversely, good ideas could be tossed out early because they did not work (most likely neutral-ish) at small scale but could have been critical for unlocking new capabilities at large scale.
Secondly, it is inherently hard to evaluate at large scale (200B+ parameters) and there are intrinsic challenges of scale, e.g., did the new idea not work because of some issues of large-scale training? Moreover, evaluation is challenging. It is not only tough to create benchmarks to probe for emergent abilities but it is easy to mess up evaluation due to over-indexing or under-indexing on certain tasks/datasets.
The compositional effect of scale × inductive bias is extremely challenging.
Unfortunately, these isn’t really any goto method for extrapolation performance under different compute regions. Meanwhile, the brute force approach for systematically scaling and evaluating models in a “tournament” style approach (scale all ideas at 100M, then 1B then 10B etc and eliminating k ideas that don’t pass each round) could be extremely computationally prohibitive. Open research questions remain:
What is the best way to design inductive biases that scale?
How can we predict emergent capabilities? Given that we know inductive bias matters for capabilities, this can be costly because we need to spend huge resources to find out!
How do we know what we’re looking for? How do we systematically probe for capabilities.
Do we gain some (emergent) capabilities while losing some?
These are really exciting times for large language model research but also perplexing at the same time!
Apart from “just scaling up”, understanding what scales and what doesn’t is also a highly important endeavor.
Leave a comment, quote this tweet or email me at ytay017 (gmail account) for any thoughts.
See you again next time!
Acknowledgements:
Special thanks for Jason Wei, Mostafa Dehghani, Vinh Tran and Hyung Won for feedback and comments on the draft!
—-
This is my first blog post, I’ll be blogging more. Do subscribe to the blog to not miss a post!