Seeded Models (SPF & Keywords)

Seeded Poisson Factorization (SPF) extends the basic PF model by incorporating domain knowledge through keyword priors. If you have ideas about what topics should look like, seeding guides the model toward discovering those topics.

When to Use Seeded Models

Use SPF when:

✓ You have prior knowledge about expected topics ✓ You can define a few keywords per expected topic ✓ You want to guide discovery without full supervision ✓ You need interpretable results aligned with expectations

Consider unsupervised PF if:

✗ You have no prior knowledge ✗ You want purely exploratory analysis ✗ You want to avoid bias from expectations

The Model

Extension of PF:

SPF adds keyword guidance via Dirichlet priors:

For seeded topics: Place stronger prior on seed words
For unseeded topics: Use standard prior (same as PF)

Generative Process:

Similar to PF, but topic-word distribution draws from informed priors:

For each topic k:
- If topic k has seeds:
    β_k ~ Dirichlet(η_seed)  # η_seed has higher values at seed positions
- Else:
    β_k ~ Dirichlet(η)       # Standard prior

This makes seed words more likely in their designated topics.

Basic Usage

from poisson_topicmodels import SPF
import numpy as np

# Define seed words for each topic
seeds = {
    0: ['research', 'data', 'experiment'],        # Science topic
    1: ['president', 'congress', 'vote'],         # Politics topic
    2: ['recipe', 'cooking', 'flavor'],           # Food topic
}

model = SPF(
    counts=counts,
    vocab=vocab,
    keywords=seeds,
    residual_topics=0,
    batch_size=32,
)

params = model.train_step(num_steps=200, lr=0.01, random_seed=42)

# Results similar to PF
top_words = model.return_top_words_per_topic(n=10)

How Seeding Works

Step 1: Define Seeds

Seeds are keywords you want associated with each topic:

seeds = {
    0: ['virus', 'vaccine', 'infection'],    # Medical
    1: ['climate', 'carbon', 'greenhouse'],   # Environment
    2: ['economy', 'trade', 'market'],        # Economics
}

Step 2: Seeds Influence Prior

The model places higher prior probability on seed words:

# Without seeds: all words equally likely a priori
# With seeds: seed words have boosted probability

Step 3: Model Learns

Training combines the informative prior with data:

Data pulls topics toward observed word distributions
Prior pulls topics toward seed words
Result: Topics incorporate seeds + learned patterns

Step 4: Interpret Results

Top words typically include most seeds, plus additional related words:

Input seeds: ['virus', 'vaccine', 'infection']

Learned top words: ['virus', 'vaccine', 'infection', 'disease',
                    'patients', 'treatment', 'symptoms', ...]

Advanced: Seed Strength

Control how strongly seeds influence the model via seed_strength:

# Weak seeding: gentle guidance
model = SPF(counts, vocab, num_topics=3, seeds=seeds, seed_strength=1.0)

# Medium seeding: standard (default = 10.0)
model = SPF(counts, vocab, num_topics=3, seeds=seeds, seed_strength=10.0)

# Strong seeding: seeds dominate
model = SPF(counts, vocab, num_topics=3, seeds=seeds, seed_strength=100.0)

Guidelines:

Lower values (1-5): Seeds as gentle suggestions
Medium values (10-50): Moderate influence (recommended)
High values (100+): Seeds strongly constrain topics

Choose based on balance desired between prior knowledge and data.

Designing Good Seeds

Do’s:

✓ Use 3-10 words per topic (avoid too few or too many) ✓ Use words characteristic of the topic ✓ Use actual vocabulary words from your corpus ✓ Ensure seeds don’t overlap across topics ✓ Choose frequent words (not rare/obscure)

Don’ts:

✗ Don’t use generic/stopwords as seeds ✗ Don’t use words not in your vocabulary ✗ Don’t repeat seeds across topics ✗ Don’t use too many seeds (>20 per topic) ✗ Don’t seed every topic (leave some unseeded for discovery)

Example Good Seeds:

seeds = {
    0: ['neural', 'learning', 'network', 'algorithm'],
    1: ['legislation', 'congress', 'bill', 'committee'],
    2: ['earnings', 'profit', 'revenue', 'dividend'],
}

Example Bad Seeds:

# Bad: Too generic
seeds = {
    0: ['the', 'is', 'and'],  # Stopwords
    1: ['thing', 'stuff'],    # Too generic
}

# Bad: Not in vocabulary
seeds = {
    0: ['xyz123', 'nonexistent_word'],  # Not in vocab
}

# Bad: Overlapping
seeds = {
    0: ['research', 'data'],
    1: ['research', 'experiment'],  # 'research' in both!
}

Mixing Seeded and Unseeded Topics

You can seed only some topics:

# Topic 0 and 1 are seeded, topic 2 is discovered freely
seeds = {
    0: ['virus', 'vaccine', 'infection'],
    1: ['climate', 'carbon', 'warming'],
    # Topic 2 has no seeds - discovered from data
}

model = SPF(
    counts=counts,
    vocab=vocab,
    keywords=seeds,
    residual_topics=1,
    batch_size=32,
)

params = model.train_step(num_steps=200, lr=0.01)

Use case: When you have ideas about some topics but want other topics discovered.

Iterative Seeding

Train unsupervised PF model
Inspect top words - identify coherent topics
Design seeds based on top words
Train SPF with those seeds
Compare results and refine seeds if needed

# Step 1: Unsupervised discovery
pf_model = PF(counts, vocab, num_topics=5, batch_size=32)
pf_model.train_step(num_steps=200, lr=0.01)

# Step 2: Inspect and design seeds
top_words_pf = pf_model.return_top_words_per_topic(n=10)
print("Top words from unsupervised model:")
for topic_id, words in top_words_pf.items():
    print(f"Topic {topic_id}: {', '.join(words)}")

# Step 3: Define seeds based on patterns
seeds = {
    0: list(top_words_pf[0][:5]),
    1: list(top_words_pf[1][:5]),
}

# Step 4: Train seeded model
spf_model = SPF(counts, vocab, keywords=seeds, residual_topics=3, batch_size=32)
spf_model.train_step(num_steps=200, lr=0.01)

# Step 5: Compare and evaluate
top_words_spf = spf_model.return_top_words_per_topic(n=10)

Practical Example

Seeding a corpus of news articles:

from poisson_topicmodels import SPF

# Define themes you expect in news
news_seeds = {
    0: ['election', 'vote', 'candidate', 'campaign'],  # Politics
    1: ['stock', 'market', 'trade', 'investment'],     # Business
    2: ['hurricane', 'flood', 'weather', 'storm'],     # Weather
    3: ['covid', 'virus', 'pandemic', 'vaccine'],      # Health
}

model = SPF(
    counts=counts,
    vocab=vocab,
    keywords=news_seeds,
    residual_topics=0,
    batch_size=64,
)

params = model.train_step(num_steps=200, lr=0.01, random_seed=42)

# Expected: Topics strongly align with seed themes
# but include additional related words from data
model.summary()
top_words = model.return_top_words_per_topic(n=15)
for topic_id, words in top_words.items():
    print(f"Topic {topic_id}: {', '.join(words)}")

# Visualize how well seeds influenced their topics
model.plot_seed_effectiveness()

Troubleshooting Seeds

Problem: Seeds don’t appear in top words

Solution: - Check seeds are in vocabulary: vocab in [word in seed for word in seeds] - Increase seed_strength - Ensure seed words actually appear in documents - Check seed words aren’t too rare

Problem: Non-seeded topics disappear

Solution: - Reduce seed strength - Use fewer seeds per topic - Ensure sufficient data per topic

Problem: Seeds make topics less coherent

Solution: - Your seeds might not match data patterns - Review actual top words from unsupervised PF - Design seeds that align with data

Validation

How to validate seeded models:

# 1. Check top words include seeds
top_words = model.return_top_words_per_topic(n=20)
for topic_id, words in top_words.items():
    if topic_id in news_seeds:
        topic_seeds = [s for s in news_seeds[topic_id] if s in words]
        coverage = len(topic_seeds) / len(news_seeds[topic_id])
        print(f"Topic {topic_id} seed coverage: {coverage:.1%}")

# 2. Measure coherence
coherence_df = model.compute_topic_coherence()
print(f"Average coherence: {coherence_df['coherence'].mean():.3f}")

# 3. Visualize seed effectiveness
model.plot_seed_effectiveness()

# 4. Compare with unsupervised
pf_model = PF(counts, vocab, num_topics=4, batch_size=32)
pf_model.train_step(num_steps=200, lr=0.01, random_seed=42)
pf_coherence = pf_model.compute_topic_coherence()
print(f"PF coherence: {pf_coherence['coherence'].mean():.3f} vs "
      f"SPF: {coherence_df['coherence'].mean():.3f}")

Comparison with Unsupervised

PF vs SPF Comparison
Aspect	PF (Unsupervised)	SPF (Seeded)
Prior knowledge needed?	Not used	Used as priors
Bias?	None	Toward seeds
Interpretability	Variable	Usually better
Time to insights	Requires reading top words	Fast (seeds guide)
Flexibility	High	Guided

Next Steps

Covariate Models (CPF & CSPF) - Add metadata to models
How-To Guides - Practical guides
API Reference - SPF API reference