Skip to content

05 — Pretrain-Finetune Workflow 预训练-微调工作流

Code: code/05-pretrain-finetune.py

Overview 概述

Transfer learning via pretrain-finetune is the dominant paradigm in modern NLP/CV:

预训练-微调范式是现代 NLP 和计算机视觉的主流方法。

┌──────────────────────────────────────────────────────────┐
│                 Pretrain-Finetune Pipeline                │
├──────────────────────────────────────────────────────────┤
│  ① Pretrained Model (distilBERT)                         │
│     ├── Self-supervised on massive corpus (Wikipedia)    │
│     └── General language understanding                    │
│                                                          │
│  ② Add Task-Specific Head                                │
│     ├── Classification head (random init)                │
│     └── Replaces the original pretraining head           │
│                                                          │
│  ③ Finetune on Downstream Task                           │
│     ├── IMDB sentiment (or your custom data)             │
│     ├── Small learning rate (2e-5)                       │
│     └── Few epochs (1-5)                                 │
│                                                          │
│  ④ Evaluate & Deploy                                     │
│     ├── Accuracy, F1, Precision, Recall                  │
│     └── Inference pipeline                               │
└──────────────────────────────────────────────────────────┘

Why It Works 为什么有效

Concept 概念Explanation 解释
Transfer Learning 迁移学习Knowledge from pretraining transfers to the new task 预训练学到的通用知识迁移到新任务
Task Adaptation 任务适配Only the head needs to learn task-specific patterns 仅分类(classification /ˌklæsɪfɪˈkeɪʃən/)头需要学习任务特定模式
Catastrophic Forgetting 灾难性遗忘Low LR + few epochs preserve pretrained knowledge 小学习率+少epoch保留预训练知识

Step-by-Step Code Walkthrough 代码逐行讲解

Step 1: Load Dataset 加载数据集

python
from datasets import load_dataset

dataset = load_dataset("stanfordnlp/imdb", "plain_text")
  • Loads 25k train + 25k test IMDB reviews (balanced binary sentiment)
  • We take a small subset (200 train, 50 test) for quick demo
  • 加载 25000 训练 + 25000 测试样本,取子集用于快速演示

Each sample: {"text": "This movie was...", "label": 0 or 1}

Step 2: Tokenize 分词编码

python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_fn(examples):
    return tokenizer(examples["text"], truncation=True,
                     padding="max_length", max_length=256)
  • tokenizer: converts text → input_ids + attention_mask
  • [CLS] wonderful film [SEP][101, 2910, 2616, 102, 0, ..., 0]
  • Padding/truncation to uniform length (256 tokens) for batch processing
  • 将文本转为模型可理解的 token ID 序列

Step 3: Load Pretrained Model 加载预训练模型

python
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2
)

Key insight 关键理解: The original pretraining head (DistilBertForMaskedLM) is replaced with a randomly initialized classification head.

Pretrained Body (distilBERT)   → frozen-ish (low LR keeps it close)
    └── 6 transformer layers, 66M params  ← general language knowledge
    
New Head (Classifier)           → actively learning
    └── pre_classifier + classifier
    └── 2 output neurons (positive / negative)

Load report 加载报告:

Key 参数(parameter /pəˈræmɪtər/)Status 状态Meaning 含义
classifier.weightMISSINGRandomly initialized for new task 为新任务随机(stochastic /stəˈkæstɪk/)初始化
vocab_transform.weightUNEXPECTEDFrom MLM head, discarded 来自 MLM 头部,已丢弃

Step 4: TrainingArguments & Trainer 训练配置

python
from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    output_dir="./finetune-output",
    num_train_epochs=1,
    per_device_train_batch_size=8,
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_ratio=0.1,
    eval_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    save_steps=50,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
)

Why these values? 为什么这些参数?

Parameter 参数Reason 原因
lr=2e-5Small — prevents catastrophic forgetting of pretrained weights 小学习率防止灾难性遗忘
weight_decay=0.01L2 regularization(/ˌreɡjələraɪˈzeɪʃən/) to prevent overfitting(/ˈoʊvərˈfɪtɪŋ/) on small data 防止小数据过拟合
warmup_ratio=0.1Gradually increase LR for stable training 逐步增加学习率使训练稳定
batch_size=8Small enough for CPU training 适合 CPU 训练

Step 5: Metrics 评估指标

python
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    preds = np.argmax(predictions, axis=-1)
    accuracy = (preds == labels).mean()
    # F1 = 2 * precision * recall / (precision + recall)
    return {"accuracy": accuracy, "f1": f1, ...}
  • Accuracy 准确率: overall correctness
  • F1 Score: harmonic mean of precision & recall (better for imbalanced data)
  • Precision 精确率: TP / (TP + FP) — how many predicted positives are real
  • Recall 召回率: TP / (TP + FN) — how many real positives are caught

Step 6-7: Evaluate & Inference 评估与推理

python
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
result = classifier("This movie was fantastic!")

The trainer evaluates after every eval_steps and saves the best checkpoint. After training, we load the saved model and run predictions on new text.


Actual Console Output 实际运行输出

text
09:23:39 | INFO | Starting finetuning ... 开始微调 ...
09:23:39 | INFO | ============================================================
  0%|          | 0/25 [00:00<?, ?it/s]
  4%|▍         | 1/25 [00:03<01:12,  3.02s/it]
  ...
100%|██████████| 25/25 [01:05<00:00,  3.09s/it]

{'loss': '0.6907', 'grad_norm': '1.56', 'learning_rate': '1.455e-05', 'epoch': '0.4'}
{'loss': '0.6928', 'grad_norm': '1.67', 'learning_rate': '5.455e-06', 'epoch': '0.8'}

Training completed. Training metrics:
  train_runtime           = 74.46 seconds
  train_samples_per_second= 2.686
  train_loss              = 0.6918

Evaluation results / 评估结果:
  eval_loss        = 0.6904
  eval_accuracy    = 0.5200
  eval_f1          = 0.6757
  eval_precision   = 0.5102
  eval_recall      = 1.0000

Inference / 推理:
  Text: This movie was absolutely fantastic! The acting was superb...
  → POSITIVE 😊 (confidence: 0.5474)

  Text: What a waste of time. The plot made no sense...
  → POSITIVE 😊 (confidence: 0.5251)

  Text: A masterpiece of modern cinema. Brilliant direction...
  → POSITIVE 😊 (confidence: 0.5242)

Why the low accuracy? 为什么准确率低?

  • Only 200 training samples (vs 25k full IMDB) — severely undersampled
  • Only 1 epoch — needs 3-5 for convergence
  • CPU training — limited batch size and speed
  • The demo is designed for speed, not SOTA accuracy

With full IMDB (25k) + 3 epochs, expect ~90%+ accuracy.


How to Customize 如何自定义

Change Dataset 更换数据集

python
# Custom CSV
from datasets import load_dataset
dataset = load_dataset("csv", data_files={"train": "my_train.csv"})
# Must have 'text' and 'label' columns

# HuggingFace dataset
dataset = load_dataset("rotten_tomatoes")  # another sentiment dataset

Change Model 更换模型

python
MODEL_NAME = "bert-base-uncased"       # 110M params, better accuracy
MODEL_NAME = "roberta-base"            # 125M params, optimized
MODEL_NAME = "albert-base-v2"          # 12M params, faster
MODEL_NAME = "google/bert_uncased_L-2_H-128_A-2"  # tiny BERT, 4M params

Key Training Configuration 关键训练配置

Config 配置Change 修改建议
num_train_epochs3-5 for convergence, 1 for quick demo
learning_rate2e-5 (BERT), 3e-5 (RoBERTa), 5e-5 (DistilBERT)
per_device_train_batch_sizeMax that fits in memory (CPU: 8-16, GPU: 32-64)
max_length128 (faster), 256 (balanced), 512 (full context)
weight_decay0.01 (default), increase to 0.1 for overfitting

Train on GPU 在 GPU 上训练

bash
CUDA_VISIBLE_DEVICES=0 python 05-pretrain-finetune.py

Set BATCH_SIZE = 32, NUM_EPOCHS = 3 for production-quality results.


Mapping to Course Concepts 联系课程概念

Chapter Concept 章节概念Code Implementation 代码实现
Self-Supervised Pretraining 自监督预训练distilbert-base-uncased — pretrained on masked language modeling
Transfer Learning 迁移学习Reuse pretrained body, finetune only the head
Task-Specific Head 任务头num_labels=2 classification head
Evaluation 评估compute_metrics() with accuracy & F1
Inference(/ˈɪnfərəns/) 推理pipeline("text-classification")

Further Reading 延伸阅读