Tokenization Visualizer

How to Use the Tokenization Visualizer

Pick the model encoding

Choose o200k_base for GPT-5, GPT-4o and GPT-4.1, or cl100k_base for GPT-4, GPT-3.5 and the embedding models. The encoding decides exactly where the token boundaries fall.

Paste your text

Type or paste any prompt, sentence or snippet of code. The visualizer runs OpenAI's exact tokenizer entirely in your browser and redraws the chips live as you edit.

Read the coloured chips

Each chip is one token, showing the actual text the model sees. Adjacent tokens use different colours so boundaries are easy to spot — a middle dot marks a space, an arrow marks a newline.

Learn the quirks

Watch how a leading space joins a word, how rare words shatter into pieces, and how numbers and emoji fragment. Hover a chip to see its position and token id.

Tokenization: How a Model Actually Reads Your Text

From characters to tokens

A large language model never sees your text the way you do. Before a single word reaches the network, a tokenizer chops the raw string into a sequence of tokens — small, reusable chunks drawn from a fixed vocabulary of tens of thousands of entries. Each token maps to an integer id, and it is that stream of integers, not the letters, that the model reads, predicts, and bills you for. Tokenization is the invisible first step that shapes everything downstream: how long your prompt counts as, how much it costs, and even how well the model reasons about the pieces of a word.

The dominant scheme for modern models is Byte Pair Encoding, or BPE. It starts from individual bytes and repeatedly merges the most frequent adjacent pair into a new symbol, learned once over a giant training corpus. Common words like the, and, or token survive as a single token because they appear constantly; rarer or longer words get assembled from several sub-word fragments. This is a deliberate trade-off — a vocabulary small enough to be efficient, yet expressive enough to spell out any word, name, or typo it has never seen by falling back to smaller pieces, all the way down to raw bytes if needed. Nothing is ever truly out-of-vocabulary; worst case, a string is encoded one byte at a time.

Spacing matters more than people expect. In these encodings a leading space is usually glued to the following word, so "token" at the start of a string and " token" mid-sentence are different tokens with different ids. That is why this visualizer marks spaces with a middle dot and newlines with an arrow: the whitespace is part of the token, not a gap between tokens, and seeing it explains a lot of otherwise baffling counts.

"You write in words. The model reads in tokens. Tokenization is the translation layer — and it is rarely as tidy as you'd guess."

Why "strawberry" splits oddly — and why it matters

The famous example is strawberry. Ask a model how many letter "r"s it contains and it often stumbles — not because it can't count, but because it never sees the individual letters. Depending on the encoding, strawberry is split into a couple of sub-word chunks like str + aw + berry, so the three "r"s are scattered across token boundaries the model can't easily peer inside. Paste it into the box above and watch it fracture. The same effect makes models weak at spelling, reversing strings, and character-level arithmetic: those tasks live below the resolution of the token.

This has very practical consequences. Token boundaries decide your real cost, since APIs charge per token, and your real context limit, since the window is measured in tokens, not words. Non-English text, especially scripts far from the training data, can use several tokens per character and quietly blow past budgets. Code tokenizes unevenly because of symbols and indentation. Even a stray emoji can expand into many tokens. Seeing the split — rather than trusting a rough "four characters per token" rule — lets you trim prompts intelligently, design better chunking, and understand why a model behaves the way it does on the letters inside a word. This tool shows you the genuine article, computed locally from the same encoding tables the OpenAI API uses.

10 Facts About Tokenization

01

Models read tokens, not letters or words — a tokenizer splits your text before the model ever sees it.

02

The dominant method is Byte Pair Encoding (BPE), which merges frequent character pairs into reusable chunks.

03

Common words are a single token; rare or long words shatter into several sub-word pieces.

04

A leading space is part of the token — "token" and " token" get different ids.

05

strawberry often splits into pieces like str·aw·berry, which is why models miscount its "r"s.

06

Nothing is truly out-of-vocabulary — worst case, text falls back to one byte per token.

07

GPT-5, GPT-4o and GPT-4.1 use o200k_base; GPT-4 and GPT-3.5 use cl100k_base.

08

Non-English text and code often use more tokens per character, quietly raising cost and context use.

09

Claude, Gemini and Llama use different tokenizers, so they split the same text differently.

10

This visualizer runs entirely in your browser — your text never touches a server or model.

Frequently Asked Questions

Tokenization is the step that splits your raw text into tokens — small chunks from a fixed vocabulary — before a language model reads it. The model processes the resulting sequence of token ids, not the original letters. Tokens are also the unit you are billed in and the unit that fills the context window.
BPE is the algorithm behind most modern tokenizers. It starts from individual bytes and repeatedly merges the most frequent adjacent pair into a new symbol, learned over a huge corpus. The result keeps common words whole while breaking rare words into reusable sub-word pieces — efficient, yet able to spell out anything.
Because BPE merges by frequency, a word like strawberry is stored as a few common sub-word chunks rather than one entry. The model sees those chunks, not the individual letters, which is why it struggles to count the "r"s. Paste it above to watch it fracture into pieces.
Use o200k_base for GPT-5, GPT-4o and GPT-4.1. Use cl100k_base for GPT-4, GPT-3.5-Turbo and the text-embedding models. The same text can split differently under each, so match the encoding to the model you'll actually call.
In these encodings a leading space is glued to the word that follows it, so it's part of the token rather than a gap between tokens. We render it as a middle dot (and newlines as an arrow) so you can see exactly where the whitespace lives inside each chip.
Yes, for OpenAI models. It runs OpenAI's own tiktoken encodings (cl100k_base and o200k_base) in your browser — the same tables the API uses — so both the count and the boundaries are exact. Other vendors use different tokenizers, so their counts will differ.
No. Anthropic's Claude, Google's Gemini and Meta's Llama each use their own vocabulary and merge rules, so the same sentence can produce a different number of tokens and different boundaries. Use this tool for OpenAI models and the provider's own tools for an exact count elsewhere.
No. The tokenizer runs entirely in your browser using a locally-served library. Your text is never uploaded to any server, model, or third party, and nothing is stored. The only network request is your browser fetching the tokenizer file from our own domain.
To keep the page responsive we draw the first 2,000 token chips and summarise the rest as "+N more". The headline token count still reflects every token in your text — only the visual chips are capped.
Completely free, with no account or sign-up, and no limit on use. It runs in your browser and collects no data.

Related News

You may be interested in these recent stories from our newsroom.

No related news yet for this tool. Our editorial team publishes new pieces every week.

Browse all news →