What happened to BERT & T5? On Transformer Encoders, PrefixLM and Denoising Objectives

The people who worked on language and NLP about 5+ years ago are left scratching their heads about where all the encoder models went. If BERT worked so well, why not scale it? What happened to encoder-decoders or encoder-only models? 

Today I try to unpack all that is going on, in this new era of LLMs. I Hope this post will be helpful.

Few months ago I was also writing a long tweet-reply to this tweet by @srush_nlp at some point. Then the tweet got deleted because I closed the tab by accident.¯\_(ツ)_/¯ I promised to write it as a blog post some day.

So here it is!

This will be the first part of a series of blog posts I plan to write about model architectures in the era of LLMs (I hope). 

A quick primer (Skip connection to next section if you feel confident)

There are mainly three overarching paradigms of model architectures in the past couple of years. Encoder-only models (e.g., BERT), Encoder-Decoder models (e.g., T5) and decoder-only models (e.g., GPT series). People get confused a lot about this and people often have tons of misconceptions about these dichotomies and architectures so I’m hoping this post will help.

The first thing to really understand is that encoder-decoder models are actually still autoregressive models. A decoder in an encoder-decoder model is literally and fundamentally still a causal decoder. Instead of pre-filling a decoder model, some text can be offloaded to an encoder, which is then sent to the decoder via cross-attention. Yes, T5 models are also language models!

A variant of this is a Prefix Language model or PrefixLM architecture, which does almost the same thing minus the cross attention (and some other small details like sharing weights between encoder/decoder plus not having no encoder bottleneck). PrefixLMs are also sometimes known as non-casual decoders. In short, encoder-decoders, decoder-only models and PrefixLMs are not that different altogether!

In the latest excellent lecture by Hyung Won, he masterfully gave explanations about the relationship between these models. You can check it out here. Good stuff.

Meanwhile, encoder-only models such as the OG BERT does denoising differently (i.e., in-place) and to some extent, rely on classification “task” heads to do anything useful with the base model after pretraining. Denoising objective was later adopted in models like T5 in an “adapted style” using the sequence to sequence format.

To this end, it is worth noting that denoising in T5 is not exactly a new objective function per se (in the machine learning sense) but rather a data transformation across inputs, i.e., you can also train span corruption objective with a causal decoder too by the way!

People always assume encoder decoder models have to be denoising models partly because of the overly representative T5 model. However, this is not always true. You can train an encoder-decoder with a regular language modeling task (i.e.m, CLM). Conversely you can also train a causal decoder with the span corruption task. As I’ve said earlier, this is mostly a data transformation.

It is also worth to note that, generally speaking, an Encoder-Decoders of 2N parameters has the same compute cost as a decoder-only model of N parameters which gives it a different FLOP to parameter count ratio. This is just like “model sparsity” that is split across inputs and targets.

This is nothing new, and I didn’t come up with anything new here. It’s already in the 2019 T5 paper and reemphasized in the UL2 paper. 

For now, glad to get this out of the way. Now onto objectives.

On the denoising objective (does it not work? does it not scale? is it too easy?)

The denoising objective I’m referring to is any variation of the “span corruption” task. This is sometimes known as “infilling” or “fill in the blank”. There are variations on how to express it (i.e., span length, randomness, sentinel tokens etc). But you get the gist. 

While the denoising objective in BERT style models are mostly “in-place”, (e.g., classification head on top of mask tokens), the slightly more modern way is to do it “T5-style”, ala data transformation that can be processed by an encoder-decoder or decoder-only model. In such a data transformation, masked tokens are just “moved to the back” for the model for prediction.

The primary goal of pretraining is to build a useful internal representation that can be aligned for downstream tasks in the most efficient and effective way possible. The better the internal representations, the easier to use these learned representations for anything useful later. The simple next word prediction  “causal language modeling” objective is known to do this very well and has served as the bread and butter of the LLM revolution. The question now at hand is whether the denoising objective is just as good.

From publicly available information, we know that T5-11B works pretty well even after being aligned/SFT-ed (Flan-T5 XXL’s MMLU score is 55+, which is more than decent for a model of this scale and of that time). Hence, we can make some conclusion that the transfer process (pretraining -> alignment) of the denoising objective works relatively reasonably well at this scale.

My take is that denoising objectives are great but pretty insufficient as a standalone objective. A big drawback is because of a reason which we could call less “loss exposure”. In denoising objectives, only a small amount of tokens are being masked and gets learned as a result (i.e., taken into account in the loss). Conversely, in regular language modeling, this is close to 100%. This makes for pretty low sample efficiency per FLOP which makes denoising objectives hugely disadvantaged on a flop-basis comparison. 

Another drawback is that denoising objectives are more unnatural than regular language modeling since it reformats the input/output in a strange way, making them a little awkward for few-shot learning. (It’s still possible to massage these models to do reasonably okay on few-shot tasks though). Hence, I believe denoising objectives should pretty much only be used as a complementary objective to regular language modeling.

The early days of unification and why xBERTs went extinct

The gradual phasing out of BERT-like models was an interesting phase that not many people talk about these days. It was subtle. This could also explain why we don’t see any mega large scale BERT models running around anymore. The reason? It was largely a matter of unification and shift in task/modeling paradigms. BERT-style models are cumbersome, but the real deprecation of BERT models was because people wanted to do all tasks at once, which led to a better way of doing denoising - using autoregressive models.

During 2018-2021, there was an implicit paradigm shift of single task finetuning to massively multi-task models. This slowly gravitated us towards the unified “SFT” models that we see today that are universal and general purpose. It was simply so hard to do this with BERT. I don’t think this has anything much to do with “denoising” at all. People simply found a way to re-express denoising pretraining tasks if they wanted to use such a model (i.e., T5) which made BERT-style models pretty much deprecated at this point because there is a strictly better alternative.

To be even more concrete, encoder-decoder and decoder-only models were able to express multiple tasks at once without the need for task specific classification heads. For Encoder-Decoders, if the decoder was getting in the way, researchers and engineers also began to find out that yanking out the encoder performed just as competitive as a BERT encoder. Moreover, it also retains the same bidirectional attention benefit that made BERT competitive over GPT models at small (often production) scale. 

The value of denoising objectives

Denoising pretraining objective also learns to predict the next word in a similar way to regular language modeling. However, different from regular causal language modeling, a data transformation is applied to the sequence such that the model learns to “fill in the blanks” instead of simply predicting the naturally occuring left-to-right text. 

Notably, denoising objectives are sometimes also called “infilling tasks” that are sometimes mashed together into pretraining together with regular language modeling tasks.

While the exact configuration and implementation details can vary, modern LLMs today may use a combination of language modeling and infilling in some capacity. It is actually interesting how this mixture of LM + infilling seemed to have been propagated around the same time (e.g., UL2, FIM, GLM, CM3), with many groups bringing their own flavor of this mixture in some way. On a side note, the largest publicly disclosed & reported model trained in this style is likely the PaLM-2 model.

It is also worth noting that pretraining task mixtures could be also stacked sequentially and does not necessarily have to be mixed concurrently, i.e., Flan-T5 originally trains on 1T span corruption tokens and switches out to 100B tokens of prefix language modeling objective before flan instruction tuning. To some extent, this qualifies as a mixed denoised/LM objective model. To be clear, prefix language modeling objective (not to be confused with architecture) is simply casual language modeling with a split point randomly determined and sent to the input side (with no loss and non-casual masking).

On a side’s note, infilling could have originated from the world of code LLMs, where filling in the blank was more of a feature desired by coding applications. Meanwhile, UL2 was motivated more by unifying the class of tasks that denoising objectives and bidirectional LLMs do well on with inherently generative tasks (such as summarization or open-ended generation). An advantage of this autoregressive style of denoising “shift to the back” is that it allows the model to not only learn longer range dependencies but also implicitly benefit from non-explicit bidirectional attention (since you would already have seen the future in order to fill in the blank).

Anecdotal experience is that denoising objectives learn representations that are better at certain classes of tasks, sometimes in a more sample efficient way. In the U-PaLM paper, we showed how a small amount of span corruption up-training changes behavior and emergence on a set of BIG-Bench tasks. On top of that, finetuning models trained with this objective generally result in better supervised fine-tuning models, especially at smaller scale.

When it comes to single-task finetuning, you can see the OG PaLM-1 62B model gets defeated by a much smaller T5 model. Bidirectional attention + denoising objective packs a punch at a relatively small scale! I’m sure many practitioners see this happen these days as well, especially in production.

What about bidirectional attention?

Bidirectional attention is an interesting “inductive bias” for language models - one that is commonly conflated with objectives and model backbones. The usefulness of inductive biases changes at different compute regions and could have different effects on scaling curves at different compute regions. That said, it could be true that bidirectional doesn’t seem to matter that much at larger scales compared to smaller scales, or have different impacts on different tasks or modalities.

As Hyung won also points out in his lecture, PrefixLM models (decoder-only models with bidirectional attention) also have an issue with caching and are an intrinsic drawback with this type of architecture. However, I think there are many ways to work around this flaw which is out of scope for this post.

Pros and Cons of Encoder-Decoder architectures

Encoder-decoder architectures actually have some pros vs regular decoder-only models. The first case is where the encoder side is not restricted by a causal mask. To some extent, you can go crazy with the attention layers by doing aggressive pooling or any form of linear attention without worrying about the autoregressive design restriction. This is a good way to offload not so important “context” to an encoder. You can also make the encoder smaller, which is neat.

One example of how encoder-decoder architecture was necessary was in Charformer, where we could go crazy on the encoder and mitigate the speed drawbacks of byte-level models. Encoder-side innovations allow quick wins without worrying about major resigns of the causal mask. 

Meanwhile, one negative point of the encoder-decoder versus PrefixLM is that inputs and targets have to have fixed allocated budgets. For instance, if the input budget is 1024 tokens, the encoder side has to be padded to this value which causes a lot of potential for wasted compute. Conversely, in PrefixLM, input and targets can just be directly concatenated which mitigates this problem.

Relevance to models today and key takeaways

The ability to reason with inductive biases both at architectural and from a pre-training perspective is a critical aspect of being a competent LLM researcher and practitioner today. Understanding the fundamental nuances helps one to extrapolate and continue to innovate.

Here are my key takeaways:

  1. Encoder-decoder and decoder-only models are both autoregressive models that have implementation-level differences and pros/cons. They are subtly different inductive biases. Optimal usage really depends on downstream use-case and pretty much application constraints. Meanwhile, for most LLM usage and niche use-cases aside, BERT style encoder models are mostly considered deprecated.

  2. Denoising objectives are mostly complementary to CLM. They have just found their way as “supporting objectives” in pretraining. Training CLM with denoising objectives usually helps in some way. While this happens very frequently in code models (i.e., code infilling), it is not uncommon (though not mandatory) for general purpose models today still pretrain with CLM with some denoising objective.

  3. Bidirectional attention helps a lot at smaller scales but is generally optional at larger scales. This is mostly anecdotal. I see bidirectional attention has a form of inductive bias, just like many other types of modifications to Transformer models.

Finally, on to recap, we don’t see any scaled up xBERTs running around: BERT models got deprecated in favor of more flexible forms of denoising (autoregressive) T5 models. This is largely due to paradigm unification where people would like to perform any task with a general purpose model (as opposed to task specific model). Meanwhile, autoregressive denoising gets sometimes folded as side objectives to casual language models. 

Final words & Acknowledgements

It’s really fun to write about model architecture. I hope to write more about LLMs and AI research in general.

I thank Hyung Won Chung and Vinh Tran for feedback on this post.

Yi Tay

Chief Scientist & Cofounder at Reka. Formerly Senior Research Scientist at Google Brain.

https://yitay.net
Next
Next

Training great LLMs entirely from ground up in the wilderness as a startup