Fundamentals
This section covers the core concepts and models in poisson-topicmodels, providing a deeper understanding of topic modeling and the different model variants available.
Fundamentals
Overview of Available Models
The poisson-topicmodels package provides several related models addressing different use cases:
Unsupervised Baseline
Poisson Factorization (PF): Discover topics without guidance or external information
Guided Discovery
Seeded PF (SPF): Incorporate domain knowledge through keyword priors
Covariate Modeling
Covariate PF (CPF): Model how topics are influenced by document-level covariates
Covariate Seeded PF (CSPF): Combine covariate effects with keyword guidance
Advanced Models
Text-Based Ideal Points (TBIP): Estimate author positions on latent dimensions
Embedded Topic Models (ETM): Integrate pre-trained word embeddings
Which Model Should I Use?
Your question: “I want to discover topics in my text data”
→ Use: Poisson Factorization (PF) or Seeded PF (SPF)
PF if you have no prior knowledge about topics
SPF if you can define some keyword seeds for expected topics
Your question: “I want to understand how topics vary by document attributes”
→ Use: Covariate PF (CPF) or Covariate Seeded PF (CSPF)
Use when you have metadata (author, date, category) and want to model how topics are affected by these attributes
Your question: “I want to estimate author or speaker positions”
→ Use: Text-Based Ideal Points (TBIP)
Model ideal points (positions on a latent scale) based on language use
Useful for political polarization, sentiment analysis across authors
Your question: “I want to use pre-trained word embeddings”
→ Use: Embedded Topic Models (ETM)
Incorporates semantic information from embeddings like Word2Vec or FastText
Often produces more semantically coherent topics
Model Comparison Table
Model |
Unsupervised? |
Guides? |
Covariates? |
Embeddings? |
|---|---|---|---|---|
PF |
✓ |
|||
SPF |
✓ (guided) |
✓ |
||
CPF |
✓ |
✓ |
||
CSPF |
✓ (guided) |
✓ |
✓ |
|
TBIP |
✓ |
|||
ETM |
✓ |
✓ |
Common Patterns
All models in poisson-topicmodels follow a consistent API:
Create: Initialize with data and parameters
model = PF(counts=counts, vocab=vocab, num_topics=10, batch_size=32)
Train: Fit the model to data
params = model.train_step(num_steps=200, lr=0.01, random_seed=42)
Summarize: Get a quick overview of the fitted model
model.summary()
Extract: Get interpretable results
categories, e_theta = model.return_topics()
beta = model.return_beta()
top_words = model.return_top_words_per_topic(n=10)
Evaluate: Quantitative diagnostics
coherence_df = model.compute_topic_coherence()
diversity = model.compute_topic_diversity()
Visualize: Built-in publication-ready plots
model.plot_model_loss()
model.plot_topic_prevalence()
model.plot_topic_correlation()
model.plot_document_topic_heatmap()
model.plot_topic_wordclouds()
Probabilistic Background
All models in this package are built on Poisson Factorization, a probabilistic framework for count data. Here’s the core idea:
Poisson Factorization Model
For each document d and word w:
Observed: word count $C_{dw}$ (how many times word w appears in document d)
Latent: topic z (which topic generated this word)
Model: $C_{dw} sim text{Poisson}(sum_z beta_z^w theta_d^z)$
Where: - $beta_z^w$ = word w probability in topic z - $theta_d^z$ = topic z intensity in document d
Why Poisson?
Natural for count data
Mathematically elegant with exponential family
Computationally efficient with SVI
Flexible foundation for extensions
Bayesian Inference
We use Stochastic Variational Inference (SVI) to estimate posterior distributions:
Place prior distributions on $beta$ and $theta$
Learn approximate posterior through variational optimization
Mini-batch training for scalability
Learn More
Each model variant has dedicated documentation:
Core Concepts - Detailed statistical background
Poisson Factorization (PF) - PF and understanding results
Seeded Models (SPF & Keywords) - SPF for incorporating domain knowledge
Covariate Models (CPF & CSPF) - CPF and CSPF for structured data
Ideal Points Models (TBIP) - TBIP for position estimation
Embedded Topic Models (ETM) - ETM for embedding integration
Ready to dive deeper? Start with Core Concepts.