Going Deeper with AI: A Brief on Tokenisation

Tokenisation is the concept of cutting text into multiple texts known as tokens. LLMs and us humans interpret writings differently.

Given the following phrase: What’s the story? Morning glory.

An LLM could view it as:

  • What’s
  • the
  • Story?
  • Morning
  • glory.

Each of this represents a single token. Each token is given a unique ID.

A different LLM could view it as:

  • What
  • ’s
  • the
  • Story?
  • Morn
  • ing
  • glory.

It seems to me this helps the LLM improve its pattern recognition.

Here’s an article by Sean Trott that goes in-depth on tokenisation: https://seantrott.substack.com/p/tokenization-in-large-language-models.

LLM providers are using tokens to decide cost of input and output.

For my case right now as a newbie AI orchestrator, knowing what tokens are should suffice.