Tokenization in LLMs

Definition

Tokenization is the process of splitting text into smaller units (tokens) such as words, subwords, or characters, which serve as inputs to language models.

Purpose

The purpose is to standardize text into manageable components for training and inference in LLMs.

Importance

  • Fundamental preprocessing step in NLP.
  • Impacts vocabulary size and efficiency.
  • Tokenization choices affect accuracy and performance.
  • Related to embeddings and model training.

How It Works

  1. Define tokenization scheme (word, subword, character).
  2. Apply tokenizer to input text.
  3. Map tokens to numerical IDs.
  4. Feed tokens into model for processing.
  5. Convert output tokens back to text.

Examples (Real World)

  • Byte Pair Encoding (BPE) used in GPT models.
  • WordPiece used in BERT.
  • SentencePiece used in multilingual NLP.

References / Further Reading

  • Sennrich et al. “Neural Machine Translation of Rare Words with Subword Units.” ACL.
  • Google SentencePiece Documentation.
  • Jurafsky & Martin. Speech and Language Processing.