Going Deeper with AI: A Brief on Tokenisation
Tokenisation is the concept of cutting text into multiple texts known as tokens. LLMs and us humans interpret writings differently.
Given the following phrase: What’s the story? Morning glory.
An LLM could view it as:
- What’s
- the
- Story?
- Morning
- glory.
Each of this represents a single token. Each token is given a unique ID.
A different LLM could view it as:
- What
- ’s
- the
- Story?
- Morn
- ing
- glory.
It seems to me this helps the LLM improve its pattern recognition.
Here’s an article by Sean Trott that goes in-depth on tokenisation: https://seantrott.substack.com/p/tokenization-in-large-language-models.
LLM providers are using tokens to decide cost of input and output.
For my case right now as a newbie AI orchestrator, knowing what tokens are should suffice.