Tutorial: Model Validation & Evaluation
How to assess the quality of your trained topic models.
Duration: ~10 minutes Prerequisites: Tutorial: Training Your First Topic Model
Validation Approaches
Three complementary approaches to validate models:
Qualitative: Manual inspection of topics
Quantitative: Metrics (coherence, perplexity)
Downstream: Performance on actual tasks
The Coherence Metric
Coherence measures if top words of a topic are semantically related.
coherence_df = model.compute_topic_coherence()
coherence = coherence_df['coherence'].values
print(f"Coherence per topic: {coherence}")
print(f"Average: {coherence.mean():.3f}")
# Topic diversity: are topics distinct?
diversity = model.compute_topic_diversity()
print(f"Topic diversity: {diversity:.3f}")
# 1.0 = all unique words, 0.0 = identical top words
Interpreting values:
0.6+: Excellent coherence (words form clear themes)
0.4-0.6: Good (topics are interpretable)
0.2-0.4: Fair (some coherence but noisy)
<0.2: Poor (topics are incoherent)
Finding low-coherence topics:
worst_topics = np.argsort(coherence)[:5]
top_words = model.return_top_words_per_topic(n=10)
for topic_id in worst_topics:
words = top_words[topic_id]
print(f"Topic {topic_id} (coherence={coherence[topic_id]:.3f}):")
print(f" {', '.join(words)}")
Qualitative Inspection
Manual topic interpretation:
def evaluate_topics_manually(model, num_to_show=10):
"""Inspect top words for each topic."""
top_words = model.return_top_words_per_topic(n=20)
ratings = {}
for topic_id, words in top_words.items():
print(f"\n=== Topic {topic_id} ===")
print(f"Top words: {', '.join(words[:10])}")
# Rate quality: 1=bad, 2=poor, 3=fair, 4=good, 5=excellent
rating = input("Rate this topic (1-5, q=quit): ")
if rating.lower() == 'q':
break
if rating.isdigit():
ratings[topic_id] = int(rating)
return ratings
# Use it
ratings = evaluate_topics_manually(model)
avg_rating = np.mean(list(ratings.values()))
print(f"\nAverage rating: {avg_rating:.1f} / 5")
Checklist for topic quality:
✓ Top words form coherent theme
✓ You can give topic a meaningful label
✓ Topic isn't all stopwords or common terms
✓ Topic doesn't duplicate another topic
✓ Topic isn't a garbage catch-all
Comparative Evaluation
Compare multiple model configurations:
results = {}
# Try different numbers of topics
for num_topics in [5, 10, 20, 50]:
model = PF(counts, vocab, num_topics=num_topics, batch_size=32)
model.train_step(num_steps=200, lr=0.01)
coherence_df = model.compute_topic_coherence()
coherence = coherence_df['coherence'].values
results[num_topics] = {
'coherence_mean': coherence.mean(),
'coherence_std': coherence.std(),
'diversity': model.compute_topic_diversity(),
'model': model
}
# Display results
print("Performance by number of topics:")
for k, v in results.items():
print(f" K={k}: coherence={v['coherence_mean']:.3f} ± {v['coherence_std']:.3f}")
# Pick best and visualize
best_k = max(results, key=lambda x: results[x]['coherence_mean'])
print(f"\nBest configuration: {best_k} topics")
Downstream Task Evaluation
If you have a downstream task, evaluate model performance there:
# Example: Use topics for document classification
from sklearn.ensemble import RandomForestClassifier
# Get document-topic representations
doc_topics_result = model.return_topics()
_, e_theta = doc_topics_result
# Train classifier on topics
clf = RandomForestClassifier()
clf.fit(e_theta, labels) # labels = ground truth
# Evaluate
accuracy = clf.score(doc_topics, labels)
print(f"Classification accuracy: {accuracy:.3f}")
# Compare with other models
results['model_quality'] = accuracy
Topic Similarity Analysis
Are topics overlapping? Check similarity:
from sklearn.metrics.pairwise import cosine_similarity
beta = model.return_beta()
similarity = cosine_similarity(beta.values.T)
np.fill_diagonal(similarity, 0)
# Find similar pairs
similar = np.where(similarity > 0.7)
top_words = model.return_top_words_per_topic(n=5)
for i, j in zip(similar[0], similar[1]):
if i < j:
print(f"Topic {i} and {j} are similar (sim={similarity[i, j]:.3f})")
print(f" Topic {i}: {', '.join(top_words[i])}")
print(f" Topic {j}: {', '.join(top_words[j])}")
Document Coverage
Do all documents get meaningful topic assignments?
_, e_theta = model.return_topics()
# Topic concentration per document
doc_entropy = -np.sum(e_theta * np.log(e_theta + 1e-10), axis=1)
max_probability = e_theta.max(axis=1)
print(f"Document topic concentration:")
print(f" Max topic probability: {max_probability.mean():.3f} ± {max_probability.std():.3f}")
print(f" Entropy: {doc_entropy.mean():.3f} ± {doc_entropy.std():.3f}")
# Low entropy = document in few topics (concentrated)
# High entropy = document spread across topics (diffuse)
# Are we getting good coverage?
if max_probability.mean() < 0.3:
print("Warning: Documents don't concentrate on topics")
print(" → Consider increasing num_topics or more training")
Visualization for Validation
import matplotlib.pyplot as plt
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# 1. Coherence distribution
coherence_df = model.compute_topic_coherence()
coherence = coherence_df['coherence'].values
axes[0, 0].hist(coherence, bins=20, edgecolor='black')
axes[0, 0].set_xlabel('Coherence')
axes[0, 0].set_ylabel('Number of topics')
axes[0, 0].set_title('Topic Coherence Distribution')
axes[0, 0].axvline(coherence.mean(), color='red', linestyle='--', label='Mean')
axes[0, 0].legend()
# 2. Topic prevalence (or use built-in: model.plot_topic_prevalence())
_, e_theta = model.return_topics()
avg_topics = e_theta.mean(axis=0)
axes[0, 1].bar(range(len(avg_topics)), avg_topics)
axes[0, 1].set_xlabel('Topic ID')
axes[0, 1].set_ylabel('Average intensity')
axes[0, 1].set_title('Topic Prevalence')
# 3. Document entropy
doc_entropy = -np.sum(e_theta * np.log(e_theta + 1e-10), axis=1)
axes[1, 0].hist(doc_entropy, bins=30, edgecolor='black')
axes[1, 0].set_xlabel('Entropy')
axes[1, 0].set_ylabel('Number of documents')
axes[1, 0].set_title('Document Topic Dispersion')
# 4. Top vs average coherence
top_topics = np.argsort(coherence)[-5:]
bottom_topics = np.argsort(coherence)[:5]
axes[1, 1].barh(range(5), coherence[bottom_topics], alpha=0.5, label='Worst')
axes[1, 1].barh(range(5, 10), coherence[top_topics], alpha=0.5, label='Best')
axes[1, 1].set_yticks(range(10))
axes[1, 1].set_yticklabels(list(bottom_topics) + list(top_topics))
axes[1, 1].set_xlabel('Coherence')
axes[1, 1].set_title('Best vs Worst Topics')
axes[1, 1].legend()
plt.tight_layout()
plt.show()
Validation Checklist
Before deploying a model:
✓ Average coherence > 0.4 ✓ No garbage topics (all stopwords) ✓ Topics aren’t highly overlapping ✓ Manual inspection: topics make sense ✓ Downstream task performance acceptable ✓ Coverage: documents get meaningful topics ✓ Reproducibility: same seed → same results
Red Flags
Model probably needs improvement if:
❌ Most topics have low coherence (<0.3)
❌ Can’t label most topics meaningfully
❌ Many topics are duplicates
❌ Some topics are all stopwords/garbage
❌ Downstream task performance is poor
❌ Many documents have flat topic distribution
Next steps when validation fails:
Try more training iterations
Adjust learning rate
Change number of topics
Improve data preprocessing
Try guided/seeded variant (SPF)
Add covariates if available (CPF)
See Tutorial: Hyperparameter Tuning for optimization strategies.
Validation Workflow
1. Train model with initial config
2. Compute coherence
3. Visualize and inspect topics
4. Check for duplicates
5. Evaluate downstream performance
If quality acceptable: ✓ Done
If not:
6. Adjust configuration
7. Retrain and repeat from 2
Version Tracking
Keep records of model evaluations:
import json
from datetime import datetime
def save_evaluation(model_name, config, results):
"""Save model evaluation results."""
eval_record = {
'timestamp': datetime.now().isoformat(),
'model_name': model_name,
'config': config,
'results': {
'mean_coherence': float(results['coherence'].mean()),
'std_coherence': float(results['coherence'].std()),
'num_low_quality': results.get('low_quality_count', 0),
}
}
with open('evaluations.json', 'a') as f:
f.write(json.dumps(eval_record) + '\n')
# Use it
config = {'num_topics': 20, 'learning_rate': 0.01}
save_evaluation('pf_model_v1', config, {'coherence': coherence})
Next Steps
Satisfied? Move to Tutorial: Hyperparameter Tuning for fine-tuning
Need to optimize? See How-To Guides
Want production-ready? Check Contributing Guide for best practices