About poisson-topicmodels

What is poisson-topicmodels?

poisson-topicmodels is a modern Python package for probabilistic topic modeling using Bayesian inference. Built on the foundation of JAX and NumPyro, it provides researchers and practitioners with powerful tools for extracting interpretable semantic structure from text data.

Statement of Need

Traditional topic modeling packages (e.g., Gensim, scikit-learn’s LDA) rely on older inference methods and lack the flexibility needed for modern research. poisson-topicmodels addresses key gaps:

Modern Probabilistic Inference

Built on NumPyro, the package enables automatic differentiation, probabilistic programming, and integration with cutting-edge Bayesian methods. This provides a solid foundation for advanced inference techniques.
Advanced Topic Models

Goes beyond LDA with guided topic discovery (keyword priors), covariate effects, ideal point estimation, and embeddings—all with principled Bayesian inference.
GPU Acceleration

Leverages JAX for transparent GPU computation, essential for large-scale corpus analysis and enabling research that would be prohibitively slow on CPU.
Scalability & Reproducibility

Optimized for mini-batch SVI training with built-in seed control for exact reproducibility—critical for research validation and publication.
Research-Friendly API

Purpose-built for computational social science and NLP researchers who need interpretable, flexible models beyond black-box approaches.

Use Cases

poisson-topicmodels is ideal for:

Computational Social Science: Analyze legislative texts, social media discourse, and policy documents to understand political positions and debate evolution.
Computational Linguistics: Extract semantic structures from linguistic corpora, study language variation, and uncover latent themes in text collections.
Digital Humanities: Analyze large text archives, track conceptual change over time, and understand thematic evolution in literature, historical documents, and cultural texts.
Market Research: Understand customer sentiment, topic distribution in reviews, and brand perception from unstructured text data.
Academic Research: Efficiently analyze large corpora of academic papers, identify research trends, and discover connections between topics and fields.

Core Philosophy

The design of poisson-topicmodels is guided by these principles:

Interpretability First: Topic models should produce human-interpretable results. The package emphasizes clear semantics and provides tools to understand and validate discovered topics.
Flexibility: Different research questions require different models. The package provides a suite of related models with shared APIs, allowing researchers to choose the right tool for their problem.
Reproducibility: Research results must be reproducible. Every component supports deterministic execution through seed control and careful API design.
Modern Stack: Building on JAX and NumPyro allows the package to leverage modern automatic differentiation and probabilistic programming capabilities.
Performance: The package is designed for large-scale analysis through GPU acceleration and efficient mini-batch training procedures.

What Sets poisson-topicmodels Apart

JAX-based: Modern automatic differentiation backend with transparent GPU support
NumPyro Integration: Direct access to probabilistic programming tools
Scalable SVI: Efficient stochastic variational inference with mini-batch training
Multiple Models: Comprehensive suite of related models (PF, SPF, CPF, CSPF, TBIP, STBS, ETM)
Research-Oriented: Designed for researchers who need flexibility and interpretability
Type Hints: Comprehensive type hints for better IDE support and code clarity
Active Development: Continuously improved with modern inference techniques

Theory Behind Topic Models

Topic models are statistical models that discover abstract “topics” in a collection of documents. Each topic is a distribution over words, and each document is a mixture of topics.

Poisson Factorization (PF) is the foundational model in this package. Unlike LDA (which uses a multinomial distribution), PF uses a Poisson distribution to model document word counts. This has several advantages:

More natural for count data
Computational efficiency
Extension flexibility for covariates and side information

Guided variants (SPF, CPF, CSPF) extend the basic model to incorporate:

Domain knowledge through keyword priors
Auxiliary information through document-level covariates
Combined constraints for sophisticated analyses

Ideal point models (TBIP, STBS) extend the framework to estimate positions of authors on latent dimensions based on their language use, with STBS adding topic-specific ideal points and author-level covariate effects.

Embedded models (ETM) incorporate pre-trained word embeddings to improve semantic coherence.

Getting Help

First time? Start with Getting Started
Want details? Check the Fundamentals
Need examples? See the Examples & Applications
API questions? Refer to API Reference
Contributing? Read Contributing Guide