Tokenizer

Un Tokenizer est le composant transformant texte brut en tokens (sous-mots, parfois mots entiers ou caractères) — les unités discrètes que le LLM traite. Reverse process à la fin : détokenize tokens output en texte lisible.

Algorithmes principaux :
(1) **BPE (Byte-Pair Encoding)** — used par GPT family, RoBERTa. Itérativement merge most frequent pair de chars/subwords jusqu'à vocab size cible (e.g. 50k tokens).
(2) **WordPiece** — used par BERT, DistilBERT. Variation de BPE avec scoring different.
(3) **SentencePiece** — used par T5, LLaMA, Gemini. Language-agnostic (no pre-tokenization by spaces required, handles Asian languages well), supports BPE et Unigram models.
(4) **Tiktoken** — OpenAI's optimized BPE tokenizer (Rust), used par GPT-3.5/4. cl100k_base, o200k_base encodings.
(5) **Tokenizers** library (HuggingFace) — fast Rust implementations of BPE/WordPiece/Unigram.

Rule of thumb token-to-word ratio : (1) English ~ 1.3 tokens per word (4 chars/token average) ; (2) French/Spanish ~ 1.5-2 tokens per word ; (3) Asian languages (Chinese, Japanese, Korean) more tokens per character car older tokenizers less optimized — newer models (Gemini, GPT-4o) much better ; (4) code généralement plus dense (2-3 tokens per line de code typique).

Exemple GPT-4 tokenization : "Hello, world!" = 4 tokens : ["Hello", ",", " world", "!"]. Counting tokens crucial pour : (1) cost estimation (LLM APIs facturent par token) ; (2) fit within context window ; (3) prompt engineering optimization.

Outils counting : (1) **tiktoken** (Python : `tiktoken.encoding_for_model("gpt-4o")`) ; (2) **OpenAI tokenizer playground** (platform.openai.com/tokenizer) ; (3) **Anthropic count_tokens API endpoint** ; (4) **HuggingFace tokenizers** library.

Importance : (1) **multilingual support** — bonne tokenization en langues non-English clé pour quality + cost equity ; (2) **code understanding** — tokenizers code-friendly (DeepSeek-Coder, CodeLlama) handle indentation, syntaxe better ; (3) **special tokens** — chat formatting (<|im_start|>, <|im_end|> for ChatML), tool calling tokens, image tokens. Compétences AI-102, PMLE.

Préparez vos certifications IT gratuitement