About poisson-topicmodels
What is poisson-topicmodels?
poisson-topicmodels is a modern Python package for probabilistic topic modeling using Bayesian inference. Built on the foundation of JAX and NumPyro, it provides researchers and practitioners with powerful tools for extracting interpretable semantic structure from text data.
Statement of Need
Traditional topic modeling packages (e.g., Gensim, scikit-learn’s LDA) rely on older inference methods and lack the flexibility needed for modern research. poisson-topicmodels addresses key gaps:
Modern Probabilistic Inference
Built on NumPyro, the package enables automatic differentiation, probabilistic programming, and integration with cutting-edge Bayesian methods. This provides a solid foundation for advanced inference techniques.
Advanced Topic Models
Goes beyond LDA with guided topic discovery (keyword priors), covariate effects, ideal point estimation, and embeddings—all with principled Bayesian inference.
GPU Acceleration
Leverages JAX for transparent GPU computation, essential for large-scale corpus analysis and enabling research that would be prohibitively slow on CPU.
Scalability & Reproducibility
Optimized for mini-batch SVI training with built-in seed control for exact reproducibility—critical for research validation and publication.
Research-Friendly API
Purpose-built for computational social science and NLP researchers who need interpretable, flexible models beyond black-box approaches.
Use Cases
poisson-topicmodels is ideal for:
Computational Social Science: Analyze legislative texts, social media discourse, and policy documents to understand political positions and debate evolution.
Computational Linguistics: Extract semantic structures from linguistic corpora, study language variation, and uncover latent themes in text collections.
Digital Humanities: Analyze large text archives, track conceptual change over time, and understand thematic evolution in literature, historical documents, and cultural texts.
Market Research: Understand customer sentiment, topic distribution in reviews, and brand perception from unstructured text data.
Academic Research: Efficiently analyze large corpora of academic papers, identify research trends, and discover connections between topics and fields.
Core Philosophy
The design of poisson-topicmodels is guided by these principles:
- Interpretability First
Topic models should produce human-interpretable results. The package emphasizes clear semantics and provides tools to understand and validate discovered topics.
- Flexibility
Different research questions require different models. The package provides a suite of related models with shared APIs, allowing researchers to choose the right tool for their problem.
- Reproducibility
Research results must be reproducible. Every component supports deterministic execution through seed control and careful API design.
- Modern Stack
Building on JAX and NumPyro allows the package to leverage modern automatic differentiation and probabilistic programming capabilities.
- Performance
The package is designed for large-scale analysis through GPU acceleration and efficient mini-batch training procedures.
What Sets poisson-topicmodels Apart
JAX-based: Modern automatic differentiation backend with transparent GPU support
NumPyro Integration: Direct access to probabilistic programming tools
Scalable SVI: Efficient stochastic variational inference with mini-batch training
Multiple Models: Comprehensive suite of related models (PF, SPF, CPF, CSPF, ETM, TBIP)
Research-Oriented: Designed for researchers who need flexibility and interpretability
Type Hints: Comprehensive type hints for better IDE support and code clarity
Active Development: Continuously improved with modern inference techniques
Theory Behind Topic Models
Topic models are statistical models that discover abstract “topics” in a collection of documents. Each topic is a distribution over words, and each document is a mixture of topics.
Poisson Factorization (PF) is the foundational model in this package. Unlike LDA (which uses a multinomial distribution), PF uses a Poisson distribution to model document word counts. This has several advantages:
More natural for count data
Computational efficiency
Extension flexibility for covariates and side information
Guided variants (SPF, CPF, CSPF) extend the basic model to incorporate:
Domain knowledge through keyword priors
Auxiliary information through document-level covariates
Combined constraints for sophisticated analyses
Ideal point models (TBIP) extend the framework to estimate positions of authors on latent dimensions based on their language use.
Embedded models (ETM) incorporate pre-trained word embeddings to improve semantic coherence.
Getting Help
First time? Start with Getting Started
Want details? Check the Fundamentals
Need examples? See the Examples & Applications
API questions? Refer to API Reference
Contributing? Read Contributing Guide