poisson-topicmodels: Probabilistic Topic Modeling with Bayesian Inference

poisson-topicmodels

poisson-topicmodels is a modern Python package for probabilistic topic modeling using Bayesian inference, built on JAX and NumPyro.

It enables researchers and practitioners to extract interpretable semantic structure from text data through advanced topic modeling techniques with transparent GPU acceleration and reproducible results.

Key Features

Modern Probabilistic Inference

Built on NumPyro for automatic differentiation, probabilistic programming, and integration with cutting-edge Bayesian methods.

Advanced Topic Models

Beyond LDA: guided topic discovery, covariate effects, ideal point estimation, and word embeddings—all with principled Bayesian inference.

GPU Acceleration

Leverages JAX for transparent GPU computation, essential for large-scale corpus analysis.

Reproducible & Scalable

Mini-batch SVI training with built-in seed control for exact reproducibility.

Research-Friendly API

Purpose-built for computational social science and NLP researchers.

The Package at a Glance

The poisson-topicmodels library provides multiple topic modeling approaches:

Model

Use Case

Key Feature

Poisson Factorization (PF)

Unsupervised baseline

Fast, interpretable word-topic associations

Seeded PF (SPF)

Guided discovery

Incorporate domain knowledge via keyword priors

Covariate PF (CPF)

Covariate effects

Model topics influenced by document metadata

Covariate Seeded PF (CSPF)

Guided + covariates

Combine keyword guidance with external factors

Text-Based Ideal Points (TBIP)

Ideal point estimation

Estimate author positions from legislative/social text

Structured Text-Based Scaling (STBS)

Topic-specific ideal points + covariates

Topic-specific ideal points with author-level covariates

Embedded Topic Models (ETM)

Modern embeddings

Integrate pre-trained word embeddings

Core Capabilities:

  • ✓ Stochastic Variational Inference (SVI) with mini-batch training

  • ✓ Transparent GPU acceleration via JAX

  • ✓ Reproducible results with seed control

  • ✓ Type hints and comprehensive API documentation

  • ✓ >70% test coverage with continuous integration

  • ✓ Clear error messages and input validation

Quick Start Example

import numpy as np
from scipy.sparse import csr_matrix
from poisson_topicmodels import PF

# Prepare data: document-term matrix and vocabulary
counts = csr_matrix(np.random.poisson(2, (100, 500)).astype(np.float32))
vocab = np.array([f'word_{i}' for i in range(500)])

# Initialize and train model
model = PF(counts, vocab, num_topics=10, batch_size=32)
params = model.train_step(num_steps=200, lr=0.01, random_seed=42)

# Summarize and inspect
model.summary()
top_words = model.return_top_words_per_topic(n=10)
for topic_id, words in top_words.items():
    print(f"Topic {topic_id}: {', '.join(words)}")

# Evaluate and visualize
print(f"Topic diversity: {model.compute_topic_diversity():.3f}")
model.plot_model_loss()
model.plot_topic_prevalence()

See Getting Started for a detailed walkthrough.

Community & Contributing

We welcome contributions! For guidelines, see the Contributing Guide.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Citation

If you use poisson-topicmodels in your research, please cite:

@software{prostmaier2026poisson,
  title={poisson-topicmodels: Probabilistic Topic Modeling with Bayesian Inference},
  author={Prostmaier, Bernd and Grün, Bettina and Hofmarcher, Paul},
  year={2026},
  url={https://github.com/BPro2410/topicmodels_package}
}