Tutorial: Training Your First Topic Model
This tutorial covers the complete workflow for training a topic model and interpreting results.
Duration: ~15 minutes Level: Beginner Prerequisites: Getting Started
Step 1: Prepare Your Data
First, organize your text data into a document-term matrix.
import numpy as np
from scipy.sparse import csr_matrix
from poisson_topicmodels import PF
# Option A: Load pre-processed data
# For this tutorial, create synthetic data
np.random.seed(42)
# Document-term matrix: 500 documents, 1000 vocabulary
num_docs, num_words = 500, 1000
# Synthetic: ~10 words per document on average
counts = csr_matrix(np.random.poisson(1, (num_docs, num_words)).astype(np.float32))
# Vocabulary is list of words
vocab = np.array([f'word_{i}' for i in range(num_words)])
print(f"Dataset shape: {counts.shape}")
print(f"Sparsity: {1 - counts.nnz / (counts.shape[0] * counts.shape[1]):.1%}")
Step 2: Initialize the Model
Create a topic model with initial configuration.
# Start with a reasonable number of topics
num_topics = 15 # Adjust based on your data and goals
model = PF(
counts=counts,
vocab=vocab,
num_topics=num_topics,
batch_size=64, # 64 documents per training batch
random_seed=42 # for reproducibility
)
print(f"Initialized PF model with {num_topics} topics")
Step 3: Train the Model
Run inference to learn topic and document-topic distributions.
# Train for 200 steps with moderate learning rate
params = model.train_step(
num_steps=200,
lr=0.01,
)
print("Training complete!")
Monitor training: Watch the loss values. Should decrease steadily then plateau.
Time expectations: - 500 docs × 1000 words: ~30 seconds on CPU, ~5 seconds on GPU - Larger datasets: scale accordingly
Step 4: Extract Results
Get the learned topics and document-topic distributions.
# 1. Get word-topic associations
beta = model.return_beta() # DataFrame: vocab_size × num_topics
print(f"Beta shape: {beta.shape}")
# 2. Get document-topic distributions
categories, e_theta = model.return_topics()
print(f"Document-topic shape: {e_theta.shape}")
# 3. Get top words per topic
top_words = model.return_top_words_per_topic(n=15)
print(f"Number of topics: {len(top_words)}"))
Step 5: Interpret Topics
Human interpretation is crucial for evaluating topic quality.
# Print the model summary first
model.summary()
# Display top words for each topic
print("=" * 60)
print("DISCOVERED TOPICS")
print("=" * 60)
top_words = model.return_top_words_per_topic(n=15)
for topic_id, words in top_words.items():
print(f"\nTopic {topic_id}:")
print(f" Top words: {', '.join(words[:10])}")
# Ask yourself:
# - Do the top words form a coherent theme?
# - Can you give each topic a human-readable label?
# - Do any topics look like garbage?
# - Are any topics too similar?
Step 6: Analyze Document-Topic Distribution
Understand how topics are distributed across documents.
import matplotlib.pyplot as plt
# 1. Topic distribution in specific documents
categories, e_theta = model.return_topics()
doc_0_topics = e_theta[0]
print(f"\nDocument 0 topic distribution:")
sorted_topics = np.argsort(doc_0_topics)[::-1]
for i, topic_id in enumerate(sorted_topics[:5]):
intensity = doc_0_topics[topic_id]
print(f" Topic {topic_id}: {intensity:.3f}")
# 2. Overall topic prevalence in corpus (built-in plot)
model.plot_topic_prevalence()
# 3. Find most interesting documents
doc_entropy = -np.sum(e_theta * np.log(e_theta + 1e-10), axis=1)
focused_docs = np.argsort(doc_entropy)[:5]
scattered_docs = np.argsort(doc_entropy)[-5:]
print(f"\nMost focused documents (highest topic concentration): {focused_docs}")
print(f"Most scattered documents (most mixed): {scattered_docs}")
Step 7: Advanced Analysis
Deeper exploration of results:
# 1. Topic similarity (built-in heatmap)
model.plot_topic_correlation()
# Or compute manually:
from sklearn.metrics.pairwise import cosine_similarity
beta = model.return_beta()
topic_similarity = cosine_similarity(beta.values.T)
np.fill_diagonal(topic_similarity, 0) # Remove self-similarity
# Find most similar topic pairs
for _ in range(3):
i, j = np.unravel_index(topic_similarity.argmax(), topic_similarity.shape)
if topic_similarity[i, j] > 0:
print(f"Topic {i} and {j} are similar (sim={topic_similarity[i, j]:.3f})")
topic_similarity[i, j] = 0
# 2. Topic specialization
# How many topics does each document use mainly?
_, e_theta = model.return_topics()
doc_dominance = (e_theta.max(axis=1) / e_theta.sum(axis=1))
print(f"\nAverage document topic dominance: {doc_dominance.mean():.3f}")
print(f" → Values close to 1: documents focus on few topics")
print(f" → Values close to {1/num_topics:.3f}: documents spread across topics")
Step 8: Quality Metrics
Evaluate model quality programmatically.
# Coherence: do top words of a topic correlate?
coherence_df = model.compute_topic_coherence()
coherence = coherence_df['coherence'].values
print(f"Topic coherence (per topic):")
print(f" Mean: {coherence.mean():.3f}")
print(f" Std: {coherence.std():.3f}")
print(f" Range: [{coherence.min():.3f}, {coherence.max():.3f}]")
# Topic diversity
diversity = model.compute_topic_diversity()
print(f"\nTopic diversity: {diversity:.3f}")
# Which topics are most coherent?
best_topics = np.argsort(coherence)[-5:]
worst_topics = np.argsort(coherence)[:5]
print(f"\nMost coherent topics: {best_topics}")
print(f"Least coherent topics: {worst_topics}")
Next: Validation and Optimization
Your trained model is done! Now consider:
Validate model quality → Tutorial: Model Validation & Evaluation
Optimize hyperparameters → Tutorial: Hyperparameter Tuning
Scale to bigger data → Tutorial: GPU Acceleration
Solve specific problems → How-To Guides
Quick Checklist
✓ Data loaded and formatted as document-term matrix ✓ Model initialized with reasonable parameters ✓ Training completed and loss decreased ✓ Topics extracted and interpreted ✓ Document-topic distributions explored ✓ Quality metrics computed
What’s Next?
Improve results: Try Tutorial: Model Validation & Evaluation to assess quality
Lots of data?: See Tutorial: GPU Acceleration for GPU acceleration
Fine-tune model: Read Tutorial: Hyperparameter Tuning
Specific task?: Browse How-To Guides
Common Issues
Q: Loss isn’t decreasing A: Try higher learning rate (0.05-0.1) or reduce batch size
Q: Topics look random A: You may need more topics or more training iterations
Q: Training is really slow A: Use GPU (see Tutorial: GPU Acceleration) or reduce vocabulary size
Q: Memory error A: Reduce batch_size or use sparse matrix format
See Fundamentals for more details on each model.