Tutorial: Training Your First Topic Model

This tutorial covers the complete workflow for training a topic model and interpreting results.

Duration: ~15 minutes Level: Beginner Prerequisites: Getting Started

Step 1: Prepare Your Data

First, organize your text data into a document-term matrix.

import numpy as np
from scipy.sparse import csr_matrix
from poisson_topicmodels import PF

# Option A: Load pre-processed data
# For this tutorial, create synthetic data
np.random.seed(42)

# Document-term matrix: 500 documents, 1000 vocabulary
num_docs, num_words = 500, 1000
# Synthetic: ~10 words per document on average
counts = csr_matrix(np.random.poisson(1, (num_docs, num_words)).astype(np.float32))

# Vocabulary is list of words
vocab = np.array([f'word_{i}' for i in range(num_words)])

print(f"Dataset shape: {counts.shape}")
print(f"Sparsity: {1 - counts.nnz / (counts.shape[0] * counts.shape[1]):.1%}")

Step 2: Initialize the Model

Create a topic model with initial configuration.

# Start with a reasonable number of topics
num_topics = 15  # Adjust based on your data and goals

model = PF(
    counts=counts,
    vocab=vocab,
    num_topics=num_topics,
    batch_size=64,  # 64 documents per training batch
    random_seed=42  # for reproducibility
)

print(f"Initialized PF model with {num_topics} topics")

Step 3: Train the Model

Run inference to learn topic and document-topic distributions.

# Train for 200 steps with moderate learning rate
params = model.train_step(
    num_steps=200,
    lr=0.01,
)

print("Training complete!")

Monitor training: Watch the loss values. Should decrease steadily then plateau.

Time expectations: - 500 docs × 1000 words: ~30 seconds on CPU, ~5 seconds on GPU - Larger datasets: scale accordingly

Step 4: Extract Results

Get the learned topics and document-topic distributions.

# 1. Get word-topic associations
beta = model.return_beta()  # DataFrame: vocab_size × num_topics
print(f"Beta shape: {beta.shape}")

# 2. Get document-topic distributions
categories, e_theta = model.return_topics()
print(f"Document-topic shape: {e_theta.shape}")

# 3. Get top words per topic
top_words = model.return_top_words_per_topic(n=15)
print(f"Number of topics: {len(top_words)}"))

Step 5: Interpret Topics

Human interpretation is crucial for evaluating topic quality.

# Print the model summary first
model.summary()

# Display top words for each topic
print("=" * 60)
print("DISCOVERED TOPICS")
print("=" * 60)

top_words = model.return_top_words_per_topic(n=15)
for topic_id, words in top_words.items():
    print(f"\nTopic {topic_id}:")
    print(f"  Top words: {', '.join(words[:10])}")

# Ask yourself:
# - Do the top words form a coherent theme?
# - Can you give each topic a human-readable label?
# - Do any topics look like garbage?
# - Are any topics too similar?

Step 6: Analyze Document-Topic Distribution

Understand how topics are distributed across documents.

import matplotlib.pyplot as plt

# 1. Topic distribution in specific documents
categories, e_theta = model.return_topics()
doc_0_topics = e_theta[0]
print(f"\nDocument 0 topic distribution:")
sorted_topics = np.argsort(doc_0_topics)[::-1]
for i, topic_id in enumerate(sorted_topics[:5]):
    intensity = doc_0_topics[topic_id]
    print(f"  Topic {topic_id}: {intensity:.3f}")

# 2. Overall topic prevalence in corpus (built-in plot)
model.plot_topic_prevalence()

# 3. Find most interesting documents
doc_entropy = -np.sum(e_theta * np.log(e_theta + 1e-10), axis=1)
focused_docs = np.argsort(doc_entropy)[:5]
scattered_docs = np.argsort(doc_entropy)[-5:]

print(f"\nMost focused documents (highest topic concentration): {focused_docs}")
print(f"Most scattered documents (most mixed): {scattered_docs}")

Step 7: Advanced Analysis

Deeper exploration of results:

# 1. Topic similarity (built-in heatmap)
model.plot_topic_correlation()

# Or compute manually:
from sklearn.metrics.pairwise import cosine_similarity
beta = model.return_beta()
topic_similarity = cosine_similarity(beta.values.T)
np.fill_diagonal(topic_similarity, 0)  # Remove self-similarity

# Find most similar topic pairs
for _ in range(3):
    i, j = np.unravel_index(topic_similarity.argmax(), topic_similarity.shape)
    if topic_similarity[i, j] > 0:
        print(f"Topic {i} and {j} are similar (sim={topic_similarity[i, j]:.3f})")
    topic_similarity[i, j] = 0

# 2. Topic specialization
# How many topics does each document use mainly?
_, e_theta = model.return_topics()
doc_dominance = (e_theta.max(axis=1) / e_theta.sum(axis=1))
print(f"\nAverage document topic dominance: {doc_dominance.mean():.3f}")
print(f"  → Values close to 1: documents focus on few topics")
print(f"  → Values close to {1/num_topics:.3f}: documents spread across topics")

Step 8: Quality Metrics

Evaluate model quality programmatically.

# Coherence: do top words of a topic correlate?
coherence_df = model.compute_topic_coherence()
coherence = coherence_df['coherence'].values
print(f"Topic coherence (per topic):")
print(f"  Mean: {coherence.mean():.3f}")
print(f"  Std: {coherence.std():.3f}")
print(f"  Range: [{coherence.min():.3f}, {coherence.max():.3f}]")

# Topic diversity
diversity = model.compute_topic_diversity()
print(f"\nTopic diversity: {diversity:.3f}")

# Which topics are most coherent?
best_topics = np.argsort(coherence)[-5:]
worst_topics = np.argsort(coherence)[:5]
print(f"\nMost coherent topics: {best_topics}")
print(f"Least coherent topics: {worst_topics}")

Next: Validation and Optimization

Your trained model is done! Now consider:

  1. Validate model qualityTutorial: Model Validation & Evaluation

  2. Optimize hyperparametersTutorial: Hyperparameter Tuning

  3. Scale to bigger dataTutorial: GPU Acceleration

  4. Solve specific problemsHow-To Guides

Quick Checklist

✓ Data loaded and formatted as document-term matrix ✓ Model initialized with reasonable parameters ✓ Training completed and loss decreased ✓ Topics extracted and interpreted ✓ Document-topic distributions explored ✓ Quality metrics computed

What’s Next?

Common Issues

Q: Loss isn’t decreasing A: Try higher learning rate (0.05-0.1) or reduce batch size

Q: Topics look random A: You may need more topics or more training iterations

Q: Training is really slow A: Use GPU (see Tutorial: GPU Acceleration) or reduce vocabulary size

Q: Memory error A: Reduce batch_size or use sparse matrix format

See Fundamentals for more details on each model.