What is TokeniZation?

Hello👋, I'm Ashraful Malik, a web developer and web designer. I create websites that not only look great but also function smoothly.
Imagine you have a toy box full of letters. To play with them you give each letter a number.
like this:
A → 1
B → 2
C → 3
D → 4
…
So if you write HELLO, it turns into numbers.
That’s called tokenization → changing text into numbers that a computer (or AI) can understand.
Three Ways to Tokenise
Characters → each letter gets a number (H → 72, e → 101).
Words → each whole word gets a number (Hello → 1, World → 2).
Subwords → break words into small parts (Hello → "Hel" + "lo").
AI models like ChatGPT use subwords because they’re the best mix of short and flexible.
🛠️ Build Your Own (Character Tokenizer)
I tried to build that using ASCII characters for character-level encoding and decoding.
Here’s a little JavaScript program that makes tokens for letters and symbols.
function buildVocab() {
const vocab = {};
const invVocab = {}; //inverse directory mapping
let id = 0;
for (let i = 32; i < 127; i++) {
// printable ASCII
vocab[String.fromCharCode(i)] = id; //take one unicode code point and convert them into characters.
invVocab[id] = String.fromCharCode(i);
id++;
}
return { vocab, invVocab };
}
// Turn text → tokens
function encode(text, vocab) {
return [...text].map((ch) => vocab[ch] ?? -1); //-1 for unknown word
}
// Turn tokens → text
function decode(token) {
return token.map((id) => invVocab[id] ?? "?").join("");
}
// Example
const { vocab, invVocab } = buildVocab();
const tokens = encode("Hello! my name is Ashraful", vocab);
console.log(tokens); // [40, 69, 76, 76, 79, 1]
console.log(decode(tokens));
Limitations
This code only works for ASCII letters and symbols (no emojis or other languages).
It’s character-level → every letter is a token. Real models usually use subwords, so they need fewer tokens.
Real AI Tokenization
Big AI models don’t use our simple character mapping.
Instead, they:
Split text into subwords.
Use special libraries like tiktoken.
Example: "Ashraful" might become
["Ash", "ra", "ful"].
You can try it yourself here:
Live Tokenizer Tool - tiktokenizer


