Going Deeper with AI: How LLMs Work

An LLM is called large because of its large dataset and its acceptance of large input / context. After Google published the Attention Is All You Need paper, transformer architecture became the gold standard of what LLMs are built on.

Prior to the transformer architecture, it looks to me that language models were usually built on recurrent neural networks (RNNs).

How they learn

Inside of an LLM lies something called parameters. These parameters decide the weight / bias of every part of the input.

Imagine we have the phrase “What’s the story? Morning glory”.

As we train the LLM, it’ll encounter the phrase over and over again – (What’s the Story?) Morning Glory is arguably Oasis’ magnum opus anyway. On each training, the LLM parameters will adjust the bias of each word when it comes to predictability and how each word correlate with one another.

Given an input like so: What’s the story?

The LLM is “educated” enough to predict that the output should be Morning glory. It won’t generate Nightly gloomy as the output! Is there even a famous text stating “What’s the story? Nightly gloomy”? Probably not!

Because “What’s the story? Morning glory” is culturally associated with Oasis, depending on our prompt, the LLM would also include the band in its output. Now, if the LLM thinks the phrase has a positive association with Blur, one ought to be worried…

Trade-offs with large number of parameters

Computing can be more expensive
Output can take longer to generate

This may well be why we have LLMs designed for specific use-cases e.g. coding, science, literature.