Poisson Factorization (PF)
The Poisson Factorization (PF) model is the foundational, unsupervised topic model in poisson-topicmodels. It automatically discovers topics without any prior guidance.
When to Use PF
Use Poisson Factorization when:
✓ You want to discover topics without prior knowledge ✓ You have document-term matrices as input ✓ You need interpretable word-topic associations ✓ You want a fast baseline model ✓ You’re exploring new document collections
Consider other models if:
✗ You have prior knowledge about expected topics (→ use SPF) ✗ You have document-level metadata (→ use CPF/CSPF) ✗ You need to estimate author positions (→ use TBIP) ✗ You want to leverage pre-trained embeddings (→ use ETM)
The Model
Generative Process:
For a corpus with D documents and V vocabulary terms:
For each document d:
Draw document-topic intensity: $theta_d sim text{Gamma}(alpha, alpha)^K$
For each word position in document:
Draw topic: $z_n sim text{Discrete}(text{softmax}(theta_d))$
Draw word: $w_n sim text{Discrete}(beta_{z_n})$
For each topic k:
Draw topic-word distribution: $beta_k sim text{Dirichlet}(eta)$
Key Properties:
Unsupervised: No labels or guidance needed
Flexible: Works with any document collection
Interpretable: Topics are directly interpretable as word distributions
Scalable: Mini-batch SVI enables large-scale inference
Example: Basic Usage
import numpy as np
from scipy.sparse import csr_matrix
from poisson_topicmodels import PF
# Prepare data
counts = csr_matrix(np.random.poisson(2, (100, 500)).astype(np.float32))
vocab = np.array([f'word_{i}' for i in range(500)])
# Create model
model = PF(
counts=counts,
vocab=vocab,
num_topics=10,
batch_size=32,
random_seed=42
)
# Train
params = model.train_step(
num_steps=200,
lr=0.01
)
# Extract results
categories, e_theta = model.return_topics() # dominant topic + proportions
beta = model.return_beta() # word-topic DataFrame
top_words = model.return_top_words_per_topic(n=10) # top words per topic
Interpreting Results
Word-Topic Matrix (β):
Access via model.return_beta() — a pd.DataFrame (vocab_size × num_topics)
Each column k represents topic k:
beta = model.return_beta()
topic_5 = beta.iloc[:, 5] # Topic 5 weights over vocabulary
Interpretation: - Higher weight means the word is more associated with the topic - Top words with highest weights characterize the topic
Document Topics (θ):
Access via model.return_topics() — returns (categories, E_theta)
categories, e_theta = model.return_topics()
# categories: dominant topic per document
# e_theta: full document-topic matrix (num_docs × num_topics)
doc_3_topics = e_theta[3, :] # Topic mixture for document 3
dominant = categories[3] # Dominant topic for document 3
Interpretation:
- categories gives the argmax topic for each document
- e_theta[d, k] is the intensity of topic k in document d
Top Words:
Access via model.return_top_words_per_topic(n=10)
top_words = model.return_top_words_per_topic(n=10)
# dict: {topic_id: [word1, word2, ...]}
print(top_words[2])
# ['research', 'data', 'experiment', 'analysis', ...]
Human Interpretation: - Read top ~10-20 words for each topic - Does the topic make sense thematically? - Can you give it a meaningful label?
Example Interpretation Workflow
model = PF(counts, vocab, num_topics=5, batch_size=32)
model.train_step(num_steps=200, lr=0.01)
# Examine each topic
top_words = model.return_top_words_per_topic(n=15)
for topic_id, words in top_words.items():
print(f"\n=== Topic {topic_id} ===")
print(f"Top words: {', '.join(words)}")
# Find documents dominated by this topic
categories, e_theta = model.return_topics()
top_docs = np.argsort(e_theta[:, topic_id])[-3:]
print(f"Top documents: {top_docs}")
Hyperparameter Selection
Number of Topics (K)
Start with 10-20 topics. Adjust based on:
Coherence: Do top words form meaningful themes?
Interpretability: Can you label each topic?
Downstream task: Does it improve your application?
# Try different numbers of topics
for k in [5, 10, 20, 50]:
model = PF(counts, vocab, num_topics=k, batch_size=32)
model.train_step(num_steps=200, lr=0.01)
# Evaluate quality (e.g., via coherence or manual inspection)
Learning Rate (lr)
Controls optimization step size. Default: 0.01
0.001: Very conservative, slow convergence
0.01: Standard, good for most cases
0.1: Aggressive, may overshoot
1.0+: Usually too large, diverges
Recommended: Start with 0.01, adjust if needed
Batch Size
Controls documents per iteration. Default: 32
16/32: Small, noisier gradients, fast iterations
64/128: Medium, balanced gradients, standard
256/512: Large, stable gradients, slower iterations
Recommended: 32-128 for balance
Iterations
How long to train. Default: 100
Monitor loss:
params = model.train_step(
num_steps=200,
lr=0.01,
)
# Then inspect: model.plot_model_loss()
Suggested: Train until loss plateaus (visual inspection)
Training Tips
Use GPU: Set JAX to use GPU for 10-100x speedup
# In Python, set before importing JAX
export JAX_PLATFORMS=gpu
python script.py
Reproducibility: Set random seed
params = model.train_step(num_steps=200, lr=0.01, random_seed=42)
# Same seed → same results (good for research)
Progress Monitoring: Check loss trajectory
model.train_step(num_steps=200, lr=0.01)
model.plot_model_loss() # visualize loss curve
Early stopping: Stop if loss plateaus
# Check loss after training
model.train_step(num_steps=200, lr=0.01)
model.plot_model_loss() # visually inspect convergence
# If loss hasn't plateaued, train more steps
Common Issues and Solutions
Problem: Topics look similar or contain generic words
Solution: - Could be too many topics - reduce K - Improve preprocessing (better stopword removal) - Look at more top words to find differences
Problem: Some topics are all garbage words
Solution: - Preprocess better (remove URLs, unicode artifacts, numbers) - Reduce number of topics - Check document-term matrix for data issues
Problem: Training is slow
Solution:
- Use GPU: export JAX_PLATFORMS=gpu
- Increase batch size (more docs per iteration)
- Reduce vocabulary size (remove rare words)
Problem: Loss not decreasing
Solution: - Increase learning rate (try 0.05-0.1) - Check data: ensure proper document-term matrix format - Try different random seed
Evaluation Metrics
See API Reference for available metrics:
Coherence: Do top words of a topic correlate?
Topic diversity: Are topics distinct?
Example:
coherence_df = model.compute_topic_coherence()
print(f"Average coherence: {coherence_df['coherence'].mean():.3f}")
diversity = model.compute_topic_diversity()
print(f"Topic diversity: {diversity:.3f}")
# Built-in visualizations
model.plot_model_loss() # Training loss curve
model.plot_topic_prevalence() # Topic prevalence bar chart
model.plot_topic_correlation() # Topic similarity heatmap
Next Steps
Add guidance: Use Seeded Models (SPF & Keywords) to incorporate domain knowledge
Model metadata: Try Covariate Models (CPF & CSPF) if you have document attributes
Advanced: Explore Tutorials for advanced topics
API details: See API Reference for full documentation