Pre-training

AI Training Data

Definition

Pre-training is the initial training of a machine learning model on large general-purpose datasets before fine-tuning on specific tasks.

Purpose

The purpose is to provide models with broad representations that transfer to multiple tasks, reducing data and compute requirements for downstream adaptation.

Importance

  • Foundation for modern LLMs and vision models.
  • Improves performance across diverse tasks.
  • Costly in terms of data and computation.
  • Requires careful dataset curation to avoid bias.

How It Works

  1. Collect massive general datasets (text, images).
  2. Define unsupervised or self-supervised learning tasks.
  3. Train models to learn general features.
  4. Save pre-trained weights for reuse.
  5. Fine-tune on smaller task-specific datasets.

Examples (Real World)

  • BERT pre-trained on Wikipedia and BooksCorpus.
  • CLIP trained on image–text pairs.
  • GPT models pre-trained on large-scale internet text.

References / Further Reading

  • Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers.” NAACL 2019.
  • Radford et al. “Language Models are Few-Shot Learners.” NeurIPS 2020.
  • OpenAI GPT-4 Technical Report.