Tutorial: Hyperparameter Tuning

Systematically optimize your topic models for best performance.

Duration: ~15 minutes Prerequisites: Tutorial: Model Validation & Evaluation

Key Hyperparameters

Three main hyperparameters control model quality and training:

  1. num_topics (K): How many topics to discover

  2. learning_rate (lr): Optimization step size

  3. batch_size: Documents per training iteration

Lesser-used but important:

  • num_iterations: Training steps (usually 100-1000)

  • random_seed: Reproducibility

num_topics: The Critical Parameter

The number of topics affects results the most.

Too few topics:

  • ❌ Vague, broad themes

  • ❌ Everything maps to few topics

  • ❌ Loss of granularity

Too many topics:

  • ❌ Redundant/overlapping topics

  • ❌ Low coherence

  • ❌ Noise capture

Optimal:

  • ✓ Coherent, interpretable themes

  • ✓ No obvious redundancy

  • ✓ Good downstream performance

Selecting num_topics:

# Strategy: try multiple values
num_topics_to_try = [5, 10, 15, 20, 30, 50, 75, 100]

results = {}
for k in num_topics_to_try:
    print(f"Training with {k} topics...")
    model = PF(counts, vocab, num_topics=k, batch_size=64)
    model.train_step(num_steps=200, lr=0.01)

    coherence_df = model.compute_topic_coherence()
    coherence = coherence_df['coherence'].values
    results[k] = {
        'coherence_mean': coherence.mean(),
        'coherence_std': coherence.std(),
        'coherence_min': coherence.min(),
        'model': model
    }

# Analyze results
import pandas as pd
df = pd.DataFrame(results).T
print(df)

# Best by coherence
best_k = df['coherence_mean'].idxmax()
print(f"\nOptimal num_topics: {best_k}")

Practical guideline:

  • Small corpus (<10k docs): Start with K=5-20

  • Medium corpus (10-100k docs): K=20-50

  • Large corpus (100k+ docs): K=50-200

learning_rate: Optimization Speed

Controls how fast the model learns.

Too low (0.001):

  • Learning is very slow

  • Many iterations needed

  • More stable but inefficient

Too high (0.5):

  • Learning is erratic

  • Loss may increase

  • Unstable convergence

Just right (0.01-0.05):

  • Steady decrease in loss

  • Converges in reasonable time

  • Reproducible results

Finding optimal learning rate:

lrs_to_try = [0.001, 0.005, 0.01, 0.05, 0.1]

for lr in lrs_to_try:
    print(f"\nTraining with learning_rate={lr}")
    model = PF(counts, vocab, num_topics=20, batch_size=64)
    model.train_step(num_steps=200, lr=lr)

Default recommendation: Start with 0.01

batch_size: Gradient Stability

Batch size affects gradient noise and GPU utilization.

Too small (16):

  • Very noisy gradients

  • Unstable training

  • Each iteration fast but need many

  • Not efficient on GPU

Too large (1024):

  • Stable gradients

  • Few iterations needed

  • Slower per-iteration

  • May not fit in GPU memory

Balanced (64-256):

  • Good stability

  • Good GPU utilization

  • Efficient training

Choosing batch_size:

# Rule of thumb: experiment with powers of 2
batch_sizes = [32, 64, 128, 256, 512]

for bs in batch_sizes:
    model = PF(counts, vocab, num_topics=20, batch_size=bs)

    import time
    t0 = time.time()
    model.train_step(num_steps=50, lr=0.01)
    elapsed = time.time() - t0

    print(f"batch_size={bs:3d}: {elapsed:.1f}s")
    # Find sweet spot of speed/quality

With GPU:

  • Start with batch_size=256 or 512

  • Increase until GPU memory error

  • Then reduce by half

Random Search (More Efficient)

import numpy as np

# Sample 20 random combinations from space
n_trials = 20
results = []

for trial in range(n_trials):
    # Random parameters
    k = np.random.choice([10, 15,20, 30, 50])
    lr = np.random.uniform(0.001, 0.1)  # log scale recommended
    bs = np.random.choice([32, 64, 128, 256])

    model = PF(counts, vocab, num_topics=k, batch_size=bs)
    model.train_step(num_steps=200, lr=lr)

    coherence_df = model.compute_topic_coherence()
    results.append({
        'num_topics': k,
        'learning_rate': lr,
        'batch_size': bs,
        'coherence': coherence_df['coherence'].mean()
    })

    print(f"Trial {trial+1}/{n_trials}: coherence={results[-1]['coherence']:.3f}")

# Best configuration
best_idx = np.argmax([r['coherence'] for r in results])
best_config = results[best_idx]
print(f"Best: {best_config}")

Practical Tuning Strategy

Step 1: Find good num_topics (most important)

# Try 5 values: rough search
for k in [10, 20, 35, 50, 75]:
    model = PF(counts, vocab, num_topics=k, batch_size=64)
    model.train_step(num_steps=200, lr=0.01)
    coherence_df = model.compute_topic_coherence()
    print(f"K={k}: {coherence_df['coherence'].mean():.3f}")

Step 2: Refine around best K

# If K=35 was best, try nearby
best_k = 35
for k in range(30, 41, 1):  # 30-40
    model = PF(counts, vocab, num_topics=k, batch_size=64)
    model.train_step(num_steps=200, lr=0.01)
    coherence_df = model.compute_topic_coherence()
    print(f"K={k}: {coherence_df['coherence'].mean():.3f}")

Step 3: Tune lr and batch_size

# With best K, try different lr values
best_k = 35  # from previous step

for lr in [0.005, 0.01, 0.02, 0.05]:
    model = PF(counts, vocab, num_topics=best_k, batch_size=64)
    model.train_step(num_steps=200, lr=lr)
    coherence_df = model.compute_topic_coherence()
    print(f"lr={lr}: {coherence_df['coherence'].mean():.3f}")

Step 4: Final validation

# Train final model with best parameters
final_model = PF(counts, vocab, num_topics=35, batch_size=128)
final_model.train_step(num_steps=500, lr=0.02)  # More steps for final

# Validate
coherence_df = final_model.compute_topic_coherence()
print(f"Final model coherence: {coherence_df['coherence'].mean():.3f}")
final_model.summary())

Early Stopping

Stop training when loss plateaus:

model = PF(counts, vocab, num_topics=20, batch_size=64)

loss_history = []
patience = 10  # Stop if no improvement for 10 iterations
best_loss = float('inf')
patience_counter = 0

for epoch in range(100):
    params = model.train_step(num_steps=10, lr=0.01)
    current_loss = model.Metrics.loss[-1] if model.Metrics.loss else float('nan')
    loss_history.append(current_loss)

    print(f"Epoch {epoch+1}: loss={current_loss:.1f}")

    # Check for improvement
    if current_loss < best_loss - 1.0:  # Improvement threshold
        best_loss = current_loss
        patience_counter = 0
        print("  ✓ Improvement!")
    else:
        patience_counter += 1
        print(f"  No improvement ({patience_counter}/{patience})")

    if patience_counter >= patience:
        print("Early stopping!")
        break

Documenting Experiments

Track your hyperparameter explorations:

import logging

logging.basicConfig(
    filename='hyperparameter_log.txt',
    level=logging.INFO,
    format='%(asctime)s - %(message)s'
)

for k in [20, 30, 50]:
    model = PF(counts, vocab, num_topics=k, batch_size=64)
    model.train_step(num_steps=200, lr=0.01)
    coherence_df = model.compute_topic_coherence()

    logging.info(f"K={k}: coherence={coherence_df['coherence'].mean():.3f}")

Common Mistakes & Solutions

Mistake: Tuning learning_rate too aggressively

Solution: It’s usually not the bottleneck. Focus on K first.

Mistake: Grid search over too many combinations

Solution: Use random search or tune one parameter at a time.

Mistake: Not tracking which configurations you’ve tried

Solution: Keep a log with timestamps and results.

Mistake: Overfitting to coherence on one dataset

Solution: Validate on held-out documents, multiple datasets.

Mistake: Not using GPU

Solution: Enable GPU - changes game for hyperparameter search!

Tuning Checklist

✓ Focus on num_topics first (most impact) ✓ Try at least 5 different values ✓ Use GPU to enable faster experimentation ✓ Document all trials ✓ Validate on held-out data ✓ Use early stopping when possible ✓ Final training: more iterations than tuning

Next Steps

Summary

  1. Num_topics is the most important parameter

  2. Learning rate usually fine at 0.01

  3. Batch size affects speed, not much else

  4. Use GPU to enable rapid experimentation

  5. Track all experiments for reproducibility

  6. Stop training when loss plateaus