.. _seeded_models: ================================================================================ Seeded Models (SPF & Keywords) ================================================================================ **Seeded Poisson Factorization (SPF)** extends the basic PF model by incorporating domain knowledge through **keyword priors**. If you have ideas about what topics should look like, seeding guides the model toward discovering those topics. When to Use Seeded Models ========================== Use SPF when: ✓ You have prior knowledge about expected topics ✓ You can define a few keywords per expected topic ✓ You want to guide discovery without full supervision ✓ You need interpretable results aligned with expectations Consider unsupervised PF if: ✗ You have no prior knowledge ✗ You want purely exploratory analysis ✗ You want to avoid bias from expectations The Model ========= **Extension of PF**: SPF adds keyword guidance via `Dirichlet priors `_: - For seeded topics: Place stronger prior on seed words - For unseeded topics: Use standard prior (same as PF) **Generative Process**: Similar to PF, but topic-word distribution draws from informed priors: .. code-block:: text For each topic k: - If topic k has seeds: β_k ~ Dirichlet(η_seed) # η_seed has higher values at seed positions - Else: β_k ~ Dirichlet(η) # Standard prior This makes seed words more likely in their designated topics. Basic Usage =========== .. code-block:: python from poisson_topicmodels import SPF import numpy as np # Define seed words for each topic seeds = { 0: ['research', 'data', 'experiment'], # Science topic 1: ['president', 'congress', 'vote'], # Politics topic 2: ['recipe', 'cooking', 'flavor'], # Food topic } model = SPF( counts=counts, vocab=vocab, keywords=seeds, residual_topics=0, batch_size=32, ) params = model.train_step(num_steps=200, lr=0.01, random_seed=42) # Results similar to PF top_words = model.return_top_words_per_topic(n=10) How Seeding Works ================= **Step 1: Define Seeds** Seeds are keywords you want associated with each topic: .. code-block:: python seeds = { 0: ['virus', 'vaccine', 'infection'], # Medical 1: ['climate', 'carbon', 'greenhouse'], # Environment 2: ['economy', 'trade', 'market'], # Economics } **Step 2: Seeds Influence Prior** The model places higher prior probability on seed words: .. code-block:: python # Without seeds: all words equally likely a priori # With seeds: seed words have boosted probability **Step 3: Model Learns** Training combines the informative prior with data: - Data pulls topics toward observed word distributions - Prior pulls topics toward seed words - Result: Topics incorporate seeds + learned patterns **Step 4: Interpret Results** Top words typically include most seeds, plus additional related words: .. code-block:: text Input seeds: ['virus', 'vaccine', 'infection'] Learned top words: ['virus', 'vaccine', 'infection', 'disease', 'patients', 'treatment', 'symptoms', ...] Advanced: Seed Strength ======================== Control how strongly seeds influence the model via `seed_strength`: .. code-block:: python # Weak seeding: gentle guidance model = SPF(counts, vocab, num_topics=3, seeds=seeds, seed_strength=1.0) # Medium seeding: standard (default = 10.0) model = SPF(counts, vocab, num_topics=3, seeds=seeds, seed_strength=10.0) # Strong seeding: seeds dominate model = SPF(counts, vocab, num_topics=3, seeds=seeds, seed_strength=100.0) Guidelines: - **Lower values** (1-5): Seeds as gentle suggestions - **Medium values** (10-50): Moderate influence (recommended) - **High values** (100+): Seeds strongly constrain topics Choose based on balance desired between prior knowledge and data. Designing Good Seeds ==================== **Do's**: ✓ Use 3-10 words per topic (avoid too few or too many) ✓ Use words characteristic of the topic ✓ Use actual vocabulary words from your corpus ✓ Ensure seeds don't overlap across topics ✓ Choose frequent words (not rare/obscure) **Don'ts**: ✗ Don't use generic/stopwords as seeds ✗ Don't use words not in your vocabulary ✗ Don't repeat seeds across topics ✗ Don't use too many seeds (>20 per topic) ✗ Don't seed every topic (leave some unseeded for discovery) Example Good Seeds: .. code-block:: python seeds = { 0: ['neural', 'learning', 'network', 'algorithm'], 1: ['legislation', 'congress', 'bill', 'committee'], 2: ['earnings', 'profit', 'revenue', 'dividend'], } Example Bad Seeds: .. code-block:: python # Bad: Too generic seeds = { 0: ['the', 'is', 'and'], # Stopwords 1: ['thing', 'stuff'], # Too generic } # Bad: Not in vocabulary seeds = { 0: ['xyz123', 'nonexistent_word'], # Not in vocab } # Bad: Overlapping seeds = { 0: ['research', 'data'], 1: ['research', 'experiment'], # 'research' in both! } Mixing Seeded and Unseeded Topics ================================== You can seed only some topics: .. code-block:: python # Topic 0 and 1 are seeded, topic 2 is discovered freely seeds = { 0: ['virus', 'vaccine', 'infection'], 1: ['climate', 'carbon', 'warming'], # Topic 2 has no seeds - discovered from data } model = SPF( counts=counts, vocab=vocab, keywords=seeds, residual_topics=1, batch_size=32, ) params = model.train_step(num_steps=200, lr=0.01) **Use case**: When you have ideas about some topics but want other topics discovered. Iterative Seeding ================= 1. Train unsupervised PF model 2. Inspect top words - identify coherent topics 3. Design seeds based on top words 4. Train SPF with those seeds 5. Compare results and refine seeds if needed .. code-block:: python # Step 1: Unsupervised discovery pf_model = PF(counts, vocab, num_topics=5, batch_size=32) pf_model.train_step(num_steps=200, lr=0.01) # Step 2: Inspect and design seeds top_words_pf = pf_model.return_top_words_per_topic(n=10) print("Top words from unsupervised model:") for topic_id, words in top_words_pf.items(): print(f"Topic {topic_id}: {', '.join(words)}") # Step 3: Define seeds based on patterns seeds = { 0: list(top_words_pf[0][:5]), 1: list(top_words_pf[1][:5]), } # Step 4: Train seeded model spf_model = SPF(counts, vocab, keywords=seeds, residual_topics=3, batch_size=32) spf_model.train_step(num_steps=200, lr=0.01) # Step 5: Compare and evaluate top_words_spf = spf_model.return_top_words_per_topic(n=10) Practical Example ================= Seeding a corpus of news articles: .. code-block:: python from poisson_topicmodels import SPF # Define themes you expect in news news_seeds = { 0: ['election', 'vote', 'candidate', 'campaign'], # Politics 1: ['stock', 'market', 'trade', 'investment'], # Business 2: ['hurricane', 'flood', 'weather', 'storm'], # Weather 3: ['covid', 'virus', 'pandemic', 'vaccine'], # Health } model = SPF( counts=counts, vocab=vocab, keywords=news_seeds, residual_topics=0, batch_size=64, ) params = model.train_step(num_steps=200, lr=0.01, random_seed=42) # Expected: Topics strongly align with seed themes # but include additional related words from data model.summary() top_words = model.return_top_words_per_topic(n=15) for topic_id, words in top_words.items(): print(f"Topic {topic_id}: {', '.join(words)}") # Visualize how well seeds influenced their topics model.plot_seed_effectiveness() Troubleshooting Seeds ===================== **Problem**: Seeds don't appear in top words *Solution*: - Check seeds are in vocabulary: ``vocab in [word in seed for word in seeds]`` - Increase seed_strength - Ensure seed words actually appear in documents - Check seed words aren't too rare **Problem**: Non-seeded topics disappear *Solution*: - Reduce seed strength - Use fewer seeds per topic - Ensure sufficient data per topic **Problem**: Seeds make topics less coherent *Solution*: - Your seeds might not match data patterns - Review actual top words from unsupervised PF - Design seeds that align with data Validation ========== How to validate seeded models: .. code-block:: python # 1. Check top words include seeds top_words = model.return_top_words_per_topic(n=20) for topic_id, words in top_words.items(): if topic_id in news_seeds: topic_seeds = [s for s in news_seeds[topic_id] if s in words] coverage = len(topic_seeds) / len(news_seeds[topic_id]) print(f"Topic {topic_id} seed coverage: {coverage:.1%}") # 2. Measure coherence coherence_df = model.compute_topic_coherence() print(f"Average coherence: {coherence_df['coherence'].mean():.3f}") # 3. Visualize seed effectiveness model.plot_seed_effectiveness() # 4. Compare with unsupervised pf_model = PF(counts, vocab, num_topics=4, batch_size=32) pf_model.train_step(num_steps=200, lr=0.01, random_seed=42) pf_coherence = pf_model.compute_topic_coherence() print(f"PF coherence: {pf_coherence['coherence'].mean():.3f} vs " f"SPF: {coherence_df['coherence'].mean():.3f}") Comparison with Unsupervised ============================= .. list-table:: PF vs SPF Comparison :widths: 25 35 35 :header-rows: 1 * - Aspect - PF (Unsupervised) - SPF (Seeded) * - Prior knowledge needed? - Not used - Used as priors * - Bias? - None - Toward seeds * - Interpretability - Variable - Usually better * - Time to insights - Requires reading top words - Fast (seeds guide) * - Flexibility - High - Guided Next Steps ========== - :doc:`covariate_models` - Add metadata to models - :doc:`../how_to_guides/index` - Practical guides - :doc:`../api/index` - SPF API reference