Getting Started
Welcome to poisson-topicmodels! This guide will get you up and running in about 5 minutes.
Quickstart: Your First Topic Model
Let’s walk through a complete example from data preparation to model interpretation.
Step 1: Import Required Libraries
import numpy as np
from scipy.sparse import csr_matrix
from poisson_topicmodels import PF
Step 2: Prepare Your Data
Topic models work with a document-term matrix (documents × vocabulary terms) and a vocabulary list.
# Create sample data: 100 documents, 500 vocabulary terms
# In practice, you would load your own text data
np.random.seed(42)
counts = csr_matrix(np.random.poisson(2, (100, 500)).astype(np.float32))
vocab = np.array([f'word_{i}' for i in range(500)])
print(f"Document-term matrix shape: {counts.shape}")
print(f"Vocabulary size: {len(vocab)}")
Data format: The document-term matrix should be a sparse matrix (rows = documents, columns = vocabulary terms) with non-negative integer counts.
Step 3: Initialize and Train the Model
# Create a Poisson Factorization model with 10 topics
model = PF(
counts=counts,
vocab=vocab,
num_topics=10,
batch_size=32,
random_seed=42
)
# Train for 200 steps with learning rate 0.01
params = model.train_step(
num_steps=200,
lr=0.01
)
Step 4: Extract and Interpret Results
# Quick summary of the fitted model
model.summary()
# Get word-topic associations
beta = model.return_beta() # DataFrame: words × topics
print(f"Beta shape: {beta.shape}")
# Get top words for each topic
top_words = model.return_top_words_per_topic(n=10)
print("\nTop 10 words per topic:")
for topic_id, words in top_words.items():
print(f"Topic {topic_id}: {', '.join(words)}")
# Get document-topic distributions
categories, e_theta = model.return_topics()
print(f"\nDocument-topic matrix shape: {e_theta.shape}")
print(f"Dominant topic for first doc: {categories[0]}")
Understanding the Output
- Beta (
return_beta()) DataFrame of shape (vocabulary_size, num_topics).
Each column is a topic: word-level association weights.
- Topics (
return_topics()) Returns
(categories, E_theta)— dominant topic per document and the full document-topic proportions matrix.- Top Words (
return_top_words_per_topic(n)) A dict mapping topic identifiers to their top-n words, useful for interpretation.
- Summary (
summary()) Prints a formatted overview of the model: loss, top words, and model-specific details.
Next Steps
Now that you have a working model, explore:
Model Variants: Fundamentals Explore all models like seeded PF (SPF) for guided discovery.
Training & Configuration: Tutorials Understand training options, hyperparameters, and GPU acceleration.
Practical Recipes: How-To Guides Common tasks and advanced workflows.
Advanced Usage: How-To Guides Extract results, customize inference, and integrate with your pipeline.
Complete Example with Real-ish Data
Here’s a more realistic example with synthetic documents that have meaningful structure:
import numpy as np
from scipy.sparse import csr_matrix
from poisson_topicmodels import PF
# Create synthetic documents with 3 underlying topics
np.random.seed(42)
num_docs = 200
num_words = 1000
num_topics = 3
# Define some "topic-specific" words
topic_words = {
0: list(range(0, 100)), # Words 0-99 for topic 0
1: list(range(100, 200)), # Words 100-199 for topic 1
2: list(range(200, 300)), # Words 200-299 for topic 2
}
# Generate documents with topic structure
counts_list = []
for doc_id in range(num_docs):
# Each document is a mixture of topics
topic_dist = np.random.dirichlet([1] * num_topics)
word_counts = np.zeros(num_words)
for topic_id in range(num_topics):
topic_weight = topic_dist[topic_id]
words_in_topic = topic_words[topic_id]
for _ in range(int(50 * topic_weight)):
word_id = np.random.choice(words_in_topic)
word_counts[word_id] += 1
counts_list.append(word_counts)
counts = csr_matrix(np.array(counts_list).astype(np.float32))
vocab = np.array([f'word_{i}' for i in range(num_words)])
# Train model with matching number of topics
model = PF(counts, vocab, num_topics=3, batch_size=32)
params = model.train_step(num_steps=200, lr=0.01, random_seed=42)
# The model should discover the 3 underlying topics
model.summary()
top_words = model.return_top_words_per_topic(n=20)
for topic_id, words in top_words.items():
print(f"\nTopic {topic_id}:")
print(f" {', '.join(words[:10])}")
Key Concepts
- Document-Term Matrix
The core input format: a sparse matrix where rows are documents and columns are vocabulary terms, containing word counts.
- Topics
Latent variables representing abstract themes. Each topic is a distribution over words.
- Topic Modeling
Statistical technique to discover and analyze latent topics in text data.
- Stochastic Variational Inference (SVI)
Efficient training method that processes documents in small batches (mini-batch training).
- GPU Acceleration
Computations run on GPU (if available) for significant speedups on large datasets.
Common Parameters
num_topics: Number of topics to discover
batch_size: Documents processed per training step
num_iterations: Training iterations
learning_rate: Step size for optimization
random_seed: For reproducibility
Tips for Best Results
Tune the Number of Topics: Start with 10-20 topics, adjust based on coherence
Use a Good Batch Size: Larger batches (256+) for stability, smaller (32) for faster iterations
Monitor Training: Check that loss decreases smoothly
Validate Topics: Read top words to verify topics make sense
Use GPU: If available, GPU acceleration provides 10-100x speedup
Set Random Seed: For reproducibility in research
What’s Next?
Read more examples: See Examples & Applications
Explore other models: Check Fundamentals for seeded, covariate, and embedded variants
Learn advanced techniques: Visit Tutorials
Check the API: Refer to API Reference for detailed documentation
Having Issues?
Check Installation for installation troubleshooting
Read How-To Guides for common tasks
Explore the full API Reference reference