Skip to main content

Command Palette

Search for a command to run...

What is TokeniZation?

Published
2 min read
What is TokeniZation?
A

Hello👋, I'm Ashraful Malik, a web developer and web designer. I create websites that not only look great but also function smoothly.

Imagine you have a toy box full of letters. To play with them you give each letter a number.

like this:

  • A → 1

  • B → 2

  • C → 3

  • D → 4

So if you write HELLO, it turns into numbers.
That’s called tokenization → changing text into numbers that a computer (or AI) can understand.

Three Ways to Tokenise

  1. Characters → each letter gets a number (H → 72, e → 101).

  2. Words → each whole word gets a number (Hello → 1, World → 2).

  3. Subwords → break words into small parts (Hello → "Hel" + "lo").
    AI models like ChatGPT use subwords because they’re the best mix of short and flexible.

🛠️ Build Your Own (Character Tokenizer)

I tried to build that using ASCII characters for character-level encoding and decoding.

Here’s a little JavaScript program that makes tokens for letters and symbols.

function buildVocab() {
  const vocab = {};
  const invVocab = {}; //inverse directory mapping
  let id = 0;

  for (let i = 32; i < 127; i++) {
    // printable ASCII
    vocab[String.fromCharCode(i)] = id; //take one unicode code point and convert them into characters.
    invVocab[id] = String.fromCharCode(i);
    id++;
  }
  return { vocab, invVocab };
}

// Turn text → tokens
function encode(text, vocab) {
  return [...text].map((ch) => vocab[ch] ?? -1); //-1 for unknown word
}

// Turn tokens → text
function decode(token) {
  return token.map((id) => invVocab[id] ?? "?").join("");
}

// Example
const { vocab, invVocab } = buildVocab();
const tokens = encode("Hello! my name is Ashraful", vocab);
console.log(tokens); // [40, 69, 76, 76, 79, 1]
console.log(decode(tokens));

Limitations

  • This code only works for ASCII letters and symbols (no emojis or other languages).

  • It’s character-level → every letter is a token. Real models usually use subwords, so they need fewer tokens.

Real AI Tokenization

Big AI models don’t use our simple character mapping.
Instead, they:

  • Split text into subwords.

  • Use special libraries like tiktoken.

  • Example: "Ashraful" might become ["Ash", "ra", "ful"].

You can try it yourself here:
Live Tokenizer Tool - tiktokenizer