An exploration of how the transition from RNNs to transformers and the discovery of double descent enabled the scaling of large language models like GPT.
Back in 2017, artificial intelligence was dominated by relatively small neural networks. The golden rule of computer science was simple: do not make your models too big. The common wisdom was that if you built a model with too many parameters, it would just memorize the training data instead of actually learning how to solve new problems.
When OpenAI built the first Generative Pre-trained Transformer, or GPT-1, in 2018, it was tiny. It was sized just to fit on a single development computer. Google followed suit with its own model, BERT, keeping it the same size for comparison.
Then came a revolutionary concept called double descent. Traditional wisdom said that as a model grows, it gets better, then worse as it begins to overfit. But researchers decided to keep going anyway. They discovered that if you make a model massively larger, past a critical threshold, something incredible happens. The performance stops degrading and suddenly starts getting much, much better. The model actually begins to generalize.
This eureka moment changed everything. It proved that scaling worked. OpenAI immediately pushed the limits, eventually building GPT-3 with one hundred and seventy-five billion parameters. This unlocked brand-new, emergent abilities, triggering the massive explosion of AI models we see today, from ChatGPT and Gemini to major breakthroughs in computer vision and biology.
Picture this. It’s 2017, we’re in the era dominated by Recurrent Neural Networks (RNN) and Convolutional Neural Networks (CNN), LSTM is cutting edge. These models are tiny, and the common wisdom is that overparameterized models don’t generalize well because they memorize everything. Then the transformers paper just came out and one year later, GPT-1.
In 2018, Alec Radford decided GPT-1 would have 12 layers and 768 dimensions. Why? At that size, the model would fit on his dev box and training would take about a month, which was at the edge of his patience.
That’s it. No other reason.
Google then went, “oh cool,” and trained BERT.
Guess how many dimensions?

“BERT BASE was chosen to have the same model size as OpenAI GPT for comparison purposes.”
Both these models were so small they’d fit on your laptop and run just fine. Because who in their right mind would go against common sense and purposely overparameterize a model to see what happens? Right?
This guy did ☝️
The first time I heard of his paper was from this 2019 video which flags it as ‘really interesting’ but the concept completely blew my mind.
It goes something like this…
*It’s called overfitting and it’s like memorizing everything for the exam but not being able to apply the knowledge outside of the exact exam material.
X=model size, Y=loss (smaller the better)
Then Mikhail goes, “to hell with it, let’s keep going see what happens, like whatever”, but he said it using smart words computer scientists use in their papers that sound like “bias variance trade-off curve” and so on.
Larger the model, the worse it gets, until suddenly…. BOOM! Starts to generalise.
And it was this precise EUREKA! moment that kickstarted the new AI revolution. Both Google and OpenAI now knew that scaling models is in facts possible, and that’s all then needed, the reassurance that it can be done.
OpenAI is like… what if we go crazy big? So they go from 117M to 1.5B parameters. Lunatics! This demonstrates emergent abilities, sparks debate about release. Google’s taking notes.

June 2020 | GPT-3 (175B params) – Shows in-context learning, few-shot capabilities.

The double descent paper was the spark which ignited the million model explosion.

2021-2024 | Explosion: T5, PaLM, ChatGPT, GPT-4, Claude, Llama, Gemini, and countless others.
The ~3 year period from 2017-2020 completely transformed NLP, and the architecture has since conquered vision (ViT), protein folding (AlphaFold), and basically every domain that involves sequences or sets.