Fundamentals

This section covers the core concepts and models in poisson-topicmodels, providing a deeper understanding of topic modeling and the different model variants available.

Fundamentals

Overview of Available Models

The poisson-topicmodels package provides several related models addressing different use cases:

Unsupervised Baseline

Poisson Factorization (PF): Discover topics without guidance or external information

Guided Discovery

Seeded PF (SPF): Incorporate domain knowledge through keyword priors

Covariate Modeling

Covariate PF (CPF): Model how topics are influenced by document-level covariates
Covariate Seeded PF (CSPF): Combine covariate effects with keyword guidance

Advanced Models

Text-Based Ideal Points (TBIP): Estimate author positions on latent dimensions
Structured Text-Based Scaling (STBS): Topic-specific ideal points with author-level covariates
Embedded Topic Models (ETM): Integrate pre-trained word embeddings

Which Model Should I Use?

Your question: “I want to discover topics in my text data”

→ Use: Poisson Factorization (PF) or Seeded PF (SPF)

PF if you have no prior knowledge about topics
SPF if you can define some keyword seeds for expected topics

Your question: “I want to understand how topics vary by document attributes”

→ Use: Covariate PF (CPF) or Covariate Seeded PF (CSPF)

Use when you have metadata (author, date, category) and want to model how topics are affected by these attributes

Your question: “I want to estimate author or speaker positions”

→ Use: Text-Based Ideal Points (TBIP) or Structured Text-Based Scaling (STBS)

Model ideal points (positions on a latent scale) based on language use
Use TBIP for a single latent position per author
Use STBS for topic-specific author positions and author-level covariate effects

Your question: “I want to use pre-trained word embeddings”

→ Use: Embedded Topic Models (ETM)

Incorporates semantic information from embeddings like Word2Vec or FastText
Often produces more semantically coherent topics

Model Comparison Table

Model Comparison
Model	Unsupervised?	Guides?	Covariates?	Ideal points?	Embeddings?
PF	✓
SPF	✓ (guided)	✓
CPF	✓		✓
CSPF	✓ (guided)	✓	✓
TBIP	✓			✓
STBS	✓		✓ (author-level)	✓ (topic-specific)
ETM	✓				✓

Common Patterns

All models in poisson-topicmodels follow a consistent API:

Create: Initialize with data and parameters

model = PF(counts=counts, vocab=vocab, num_topics=10, batch_size=32)

Train: Fit the model to data

params = model.train_step(num_steps=200, lr=0.01, random_seed=42)

Summarize: Get a quick overview of the fitted model

model.summary()

Extract: Get interpretable results

categories, e_theta = model.return_topics()
beta = model.return_beta()
top_words = model.return_top_words_per_topic(n=10)

Evaluate: Quantitative diagnostics

coherence_df = model.compute_topic_coherence()
diversity = model.compute_topic_diversity()

Visualize: Built-in publication-ready plots

model.plot_model_loss()
model.plot_topic_prevalence()
model.plot_topic_correlation()
model.plot_document_topic_heatmap()
model.plot_topic_wordclouds()

Probabilistic Background

All models in this package are built on Poisson Factorization, a probabilistic framework for count data. Here’s the core idea:

Poisson Factorization Model

For each document d and word w:

Observed: word count $C_{dw}$ (how many times word w appears in document d)
Latent: topic z (which topic generated this word)
Model: $C_{dw} sim text{Poisson}(sum_z beta_z^w theta_d^z)$

Where: - $beta_z^w$ = word w probability in topic z - $theta_d^z$ = topic z intensity in document d

Why Poisson?

Natural for count data
Mathematically elegant with exponential family
Computationally efficient with SVI
Flexible foundation for extensions

Bayesian Inference

We use Stochastic Variational Inference (SVI) to estimate posterior distributions:

Place prior distributions on $beta$ and $theta$
Learn approximate posterior through variational optimization
Mini-batch training for scalability

Learn More

Each model variant has dedicated documentation:

Core Concepts - Detailed statistical background
Poisson Factorization (PF) - PF and understanding results
Seeded Models (SPF & Keywords) - SPF for incorporating domain knowledge
Covariate Models (CPF & CSPF) - CPF and CSPF for structured data
Ideal Points Models (TBIP & STBS) - TBIP and STBS for position estimation
Embedded Topic Models (ETM) - ETM for embedding integration

Ready to dive deeper? Start with Core Concepts.