Fundamentals

This section covers the core concepts and models in poisson-topicmodels, providing a deeper understanding of topic modeling and the different model variants available.

Fundamentals

Overview of Available Models

The poisson-topicmodels package provides several related models addressing different use cases:

Unsupervised Baseline

  • Poisson Factorization (PF): Discover topics without guidance or external information

Guided Discovery

  • Seeded PF (SPF): Incorporate domain knowledge through keyword priors

Covariate Modeling

  • Covariate PF (CPF): Model how topics are influenced by document-level covariates

  • Covariate Seeded PF (CSPF): Combine covariate effects with keyword guidance

Advanced Models

  • Text-Based Ideal Points (TBIP): Estimate author positions on latent dimensions

  • Structured Text-Based Scaling (STBS): Topic-specific ideal points with author-level covariates

  • Embedded Topic Models (ETM): Integrate pre-trained word embeddings

Which Model Should I Use?

Your question: “I want to discover topics in my text data”

Use: Poisson Factorization (PF) or Seeded PF (SPF)

  • PF if you have no prior knowledge about topics

  • SPF if you can define some keyword seeds for expected topics

Your question: “I want to understand how topics vary by document attributes”

Use: Covariate PF (CPF) or Covariate Seeded PF (CSPF)

  • Use when you have metadata (author, date, category) and want to model how topics are affected by these attributes

Your question: “I want to estimate author or speaker positions”

Use: Text-Based Ideal Points (TBIP) or Structured Text-Based Scaling (STBS)

  • Model ideal points (positions on a latent scale) based on language use

  • Use TBIP for a single latent position per author

  • Use STBS for topic-specific author positions and author-level covariate effects

Your question: “I want to use pre-trained word embeddings”

Use: Embedded Topic Models (ETM)

  • Incorporates semantic information from embeddings like Word2Vec or FastText

  • Often produces more semantically coherent topics

Model Comparison Table

Model Comparison

Model

Unsupervised?

Guides?

Covariates?

Ideal points?

Embeddings?

PF

SPF

✓ (guided)

CPF

CSPF

✓ (guided)

TBIP

STBS

✓ (author-level)

✓ (topic-specific)

ETM

Common Patterns

All models in poisson-topicmodels follow a consistent API:

Create: Initialize with data and parameters

model = PF(counts=counts, vocab=vocab, num_topics=10, batch_size=32)

Train: Fit the model to data

params = model.train_step(num_steps=200, lr=0.01, random_seed=42)

Summarize: Get a quick overview of the fitted model

model.summary()

Extract: Get interpretable results

categories, e_theta = model.return_topics()
beta = model.return_beta()
top_words = model.return_top_words_per_topic(n=10)

Evaluate: Quantitative diagnostics

coherence_df = model.compute_topic_coherence()
diversity = model.compute_topic_diversity()

Visualize: Built-in publication-ready plots

model.plot_model_loss()
model.plot_topic_prevalence()
model.plot_topic_correlation()
model.plot_document_topic_heatmap()
model.plot_topic_wordclouds()

Probabilistic Background

All models in this package are built on Poisson Factorization, a probabilistic framework for count data. Here’s the core idea:

Poisson Factorization Model

For each document d and word w:

  • Observed: word count $C_{dw}$ (how many times word w appears in document d)

  • Latent: topic z (which topic generated this word)

  • Model: $C_{dw} sim text{Poisson}(sum_z beta_z^w theta_d^z)$

Where: - $beta_z^w$ = word w probability in topic z - $theta_d^z$ = topic z intensity in document d

Why Poisson?

  • Natural for count data

  • Mathematically elegant with exponential family

  • Computationally efficient with SVI

  • Flexible foundation for extensions

Bayesian Inference

We use Stochastic Variational Inference (SVI) to estimate posterior distributions:

  • Place prior distributions on $beta$ and $theta$

  • Learn approximate posterior through variational optimization

  • Mini-batch training for scalability

Learn More

Each model variant has dedicated documentation:

Ready to dive deeper? Start with Core Concepts.