Seeded Models (SPF & Keywords)
Seeded Poisson Factorization (SPF) extends the basic PF model by incorporating domain knowledge through keyword priors. If you have ideas about what topics should look like, seeding guides the model toward discovering those topics.
When to Use Seeded Models
Use SPF when:
✓ You have prior knowledge about expected topics ✓ You can define a few keywords per expected topic ✓ You want to guide discovery without full supervision ✓ You need interpretable results aligned with expectations
Consider unsupervised PF if:
✗ You have no prior knowledge ✗ You want purely exploratory analysis ✗ You want to avoid bias from expectations
The Model
Extension of PF:
SPF adds keyword guidance via Dirichlet priors:
For seeded topics: Place stronger prior on seed words
For unseeded topics: Use standard prior (same as PF)
Generative Process:
Similar to PF, but topic-word distribution draws from informed priors:
For each topic k:
- If topic k has seeds:
β_k ~ Dirichlet(η_seed) # η_seed has higher values at seed positions
- Else:
β_k ~ Dirichlet(η) # Standard prior
This makes seed words more likely in their designated topics.
Basic Usage
from poisson_topicmodels import SPF
import numpy as np
# Define seed words for each topic
seeds = {
0: ['research', 'data', 'experiment'], # Science topic
1: ['president', 'congress', 'vote'], # Politics topic
2: ['recipe', 'cooking', 'flavor'], # Food topic
}
model = SPF(
counts=counts,
vocab=vocab,
keywords=seeds,
residual_topics=0,
batch_size=32,
)
params = model.train_step(num_steps=200, lr=0.01, random_seed=42)
# Results similar to PF
top_words = model.return_top_words_per_topic(n=10)
How Seeding Works
Step 1: Define Seeds
Seeds are keywords you want associated with each topic:
seeds = {
0: ['virus', 'vaccine', 'infection'], # Medical
1: ['climate', 'carbon', 'greenhouse'], # Environment
2: ['economy', 'trade', 'market'], # Economics
}
Step 2: Seeds Influence Prior
The model places higher prior probability on seed words:
# Without seeds: all words equally likely a priori
# With seeds: seed words have boosted probability
Step 3: Model Learns
Training combines the informative prior with data:
Data pulls topics toward observed word distributions
Prior pulls topics toward seed words
Result: Topics incorporate seeds + learned patterns
Step 4: Interpret Results
Top words typically include most seeds, plus additional related words:
Input seeds: ['virus', 'vaccine', 'infection']
Learned top words: ['virus', 'vaccine', 'infection', 'disease',
'patients', 'treatment', 'symptoms', ...]
Advanced: Seed Strength
Control how strongly seeds influence the model via seed_strength:
# Weak seeding: gentle guidance
model = SPF(counts, vocab, num_topics=3, seeds=seeds, seed_strength=1.0)
# Medium seeding: standard (default = 10.0)
model = SPF(counts, vocab, num_topics=3, seeds=seeds, seed_strength=10.0)
# Strong seeding: seeds dominate
model = SPF(counts, vocab, num_topics=3, seeds=seeds, seed_strength=100.0)
Guidelines:
Lower values (1-5): Seeds as gentle suggestions
Medium values (10-50): Moderate influence (recommended)
High values (100+): Seeds strongly constrain topics
Choose based on balance desired between prior knowledge and data.
Designing Good Seeds
Do’s:
✓ Use 3-10 words per topic (avoid too few or too many) ✓ Use words characteristic of the topic ✓ Use actual vocabulary words from your corpus ✓ Ensure seeds don’t overlap across topics ✓ Choose frequent words (not rare/obscure)
Don’ts:
✗ Don’t use generic/stopwords as seeds ✗ Don’t use words not in your vocabulary ✗ Don’t repeat seeds across topics ✗ Don’t use too many seeds (>20 per topic) ✗ Don’t seed every topic (leave some unseeded for discovery)
Example Good Seeds:
seeds = {
0: ['neural', 'learning', 'network', 'algorithm'],
1: ['legislation', 'congress', 'bill', 'committee'],
2: ['earnings', 'profit', 'revenue', 'dividend'],
}
Example Bad Seeds:
# Bad: Too generic
seeds = {
0: ['the', 'is', 'and'], # Stopwords
1: ['thing', 'stuff'], # Too generic
}
# Bad: Not in vocabulary
seeds = {
0: ['xyz123', 'nonexistent_word'], # Not in vocab
}
# Bad: Overlapping
seeds = {
0: ['research', 'data'],
1: ['research', 'experiment'], # 'research' in both!
}
Mixing Seeded and Unseeded Topics
You can seed only some topics:
# Topic 0 and 1 are seeded, topic 2 is discovered freely
seeds = {
0: ['virus', 'vaccine', 'infection'],
1: ['climate', 'carbon', 'warming'],
# Topic 2 has no seeds - discovered from data
}
model = SPF(
counts=counts,
vocab=vocab,
keywords=seeds,
residual_topics=1,
batch_size=32,
)
params = model.train_step(num_steps=200, lr=0.01)
Use case: When you have ideas about some topics but want other topics discovered.
Iterative Seeding
Train unsupervised PF model
Inspect top words - identify coherent topics
Design seeds based on top words
Train SPF with those seeds
Compare results and refine seeds if needed
# Step 1: Unsupervised discovery
pf_model = PF(counts, vocab, num_topics=5, batch_size=32)
pf_model.train_step(num_steps=200, lr=0.01)
# Step 2: Inspect and design seeds
top_words_pf = pf_model.return_top_words_per_topic(n=10)
print("Top words from unsupervised model:")
for topic_id, words in top_words_pf.items():
print(f"Topic {topic_id}: {', '.join(words)}")
# Step 3: Define seeds based on patterns
seeds = {
0: list(top_words_pf[0][:5]),
1: list(top_words_pf[1][:5]),
}
# Step 4: Train seeded model
spf_model = SPF(counts, vocab, keywords=seeds, residual_topics=3, batch_size=32)
spf_model.train_step(num_steps=200, lr=0.01)
# Step 5: Compare and evaluate
top_words_spf = spf_model.return_top_words_per_topic(n=10)
Practical Example
Seeding a corpus of news articles:
from poisson_topicmodels import SPF
# Define themes you expect in news
news_seeds = {
0: ['election', 'vote', 'candidate', 'campaign'], # Politics
1: ['stock', 'market', 'trade', 'investment'], # Business
2: ['hurricane', 'flood', 'weather', 'storm'], # Weather
3: ['covid', 'virus', 'pandemic', 'vaccine'], # Health
}
model = SPF(
counts=counts,
vocab=vocab,
keywords=news_seeds,
residual_topics=0,
batch_size=64,
)
params = model.train_step(num_steps=200, lr=0.01, random_seed=42)
# Expected: Topics strongly align with seed themes
# but include additional related words from data
model.summary()
top_words = model.return_top_words_per_topic(n=15)
for topic_id, words in top_words.items():
print(f"Topic {topic_id}: {', '.join(words)}")
# Visualize how well seeds influenced their topics
model.plot_seed_effectiveness()
Troubleshooting Seeds
Problem: Seeds don’t appear in top words
Solution:
- Check seeds are in vocabulary: vocab in [word in seed for word in seeds]
- Increase seed_strength
- Ensure seed words actually appear in documents
- Check seed words aren’t too rare
Problem: Non-seeded topics disappear
Solution: - Reduce seed strength - Use fewer seeds per topic - Ensure sufficient data per topic
Problem: Seeds make topics less coherent
Solution: - Your seeds might not match data patterns - Review actual top words from unsupervised PF - Design seeds that align with data
Validation
How to validate seeded models:
# 1. Check top words include seeds
top_words = model.return_top_words_per_topic(n=20)
for topic_id, words in top_words.items():
if topic_id in news_seeds:
topic_seeds = [s for s in news_seeds[topic_id] if s in words]
coverage = len(topic_seeds) / len(news_seeds[topic_id])
print(f"Topic {topic_id} seed coverage: {coverage:.1%}")
# 2. Measure coherence
coherence_df = model.compute_topic_coherence()
print(f"Average coherence: {coherence_df['coherence'].mean():.3f}")
# 3. Visualize seed effectiveness
model.plot_seed_effectiveness()
# 4. Compare with unsupervised
pf_model = PF(counts, vocab, num_topics=4, batch_size=32)
pf_model.train_step(num_steps=200, lr=0.01, random_seed=42)
pf_coherence = pf_model.compute_topic_coherence()
print(f"PF coherence: {pf_coherence['coherence'].mean():.3f} vs "
f"SPF: {coherence_df['coherence'].mean():.3f}")
Comparison with Unsupervised
Aspect |
PF (Unsupervised) |
SPF (Seeded) |
|---|---|---|
Prior knowledge needed? |
Not used |
Used as priors |
Bias? |
None |
Toward seeds |
Interpretability |
Variable |
Usually better |
Time to insights |
Requires reading top words |
Fast (seeds guide) |
Flexibility |
High |
Guided |
Next Steps
Covariate Models (CPF & CSPF) - Add metadata to models
How-To Guides - Practical guides
API Reference - SPF API reference