.. _about: ================================================================================ About poisson-topicmodels ================================================================================ What is poisson-topicmodels? ============================= **poisson-topicmodels** is a modern Python package for probabilistic topic modeling using Bayesian inference. Built on the foundation of `JAX `_ and `NumPyro `_, it provides researchers and practitioners with powerful tools for extracting interpretable semantic structure from text data. Statement of Need ================= Traditional topic modeling packages (e.g., Gensim, scikit-learn's LDA) rely on older inference methods and lack the flexibility needed for modern research. **poisson-topicmodels** addresses key gaps: 1. **Modern Probabilistic Inference** Built on NumPyro, the package enables automatic differentiation, probabilistic programming, and integration with cutting-edge Bayesian methods. This provides a solid foundation for advanced inference techniques. 2. **Advanced Topic Models** Goes beyond LDA with guided topic discovery (keyword priors), covariate effects, ideal point estimation, and embeddings—all with principled Bayesian inference. 3. **GPU Acceleration** Leverages JAX for transparent GPU computation, essential for large-scale corpus analysis and enabling research that would be prohibitively slow on CPU. 4. **Scalability & Reproducibility** Optimized for mini-batch SVI training with built-in seed control for exact reproducibility—critical for research validation and publication. 5. **Research-Friendly API** Purpose-built for computational social science and NLP researchers who need interpretable, flexible models beyond black-box approaches. Use Cases ========= **poisson-topicmodels** is ideal for: - **Computational Social Science**: Analyze legislative texts, social media discourse, and policy documents to understand political positions and debate evolution. - **Computational Linguistics**: Extract semantic structures from linguistic corpora, study language variation, and uncover latent themes in text collections. - **Digital Humanities**: Analyze large text archives, track conceptual change over time, and understand thematic evolution in literature, historical documents, and cultural texts. - **Market Research**: Understand customer sentiment, topic distribution in reviews, and brand perception from unstructured text data. - **Academic Research**: Efficiently analyze large corpora of academic papers, identify research trends, and discover connections between topics and fields. Core Philosophy =============== The design of **poisson-topicmodels** is guided by these principles: **Interpretability First** Topic models should produce human-interpretable results. The package emphasizes clear semantics and provides tools to understand and validate discovered topics. **Flexibility** Different research questions require different models. The package provides a suite of related models with shared APIs, allowing researchers to choose the right tool for their problem. **Reproducibility** Research results must be reproducible. Every component supports deterministic execution through seed control and careful API design. **Modern Stack** Building on JAX and NumPyro allows the package to leverage modern automatic differentiation and probabilistic programming capabilities. **Performance** The package is designed for large-scale analysis through GPU acceleration and efficient mini-batch training procedures. Related Packages ================ For context, here are some related packages in the topic modeling and Bayesian inference ecosystem: - **Gensim**: Classic topic modeling library with LDA, LSI, and word embeddings - **scikit-learn**: Machine learning toolkit includes LDA implementation - **PyMC**: Probabilistic programming framework for Bayesian modeling - **Stan**: Probabilistic programming language used via pystan interface - **PyTorch and TensorFlow**: Deep learning frameworks with probabilistic extensions What Sets poisson-topicmodels Apart ==================================== 1. **JAX-based**: Modern automatic differentiation backend with transparent GPU support 2. **NumPyro Integration**: Direct access to probabilistic programming tools 3. **Scalable SVI**: Efficient stochastic variational inference with mini-batch training 4. **Multiple Models**: Comprehensive suite of related models (PF, SPF, CPF, CSPF, TBIP, STBS, ETM) 5. **Research-Oriented**: Designed for researchers who need flexibility and interpretability 6. **Type Hints**: Comprehensive type hints for better IDE support and code clarity 7. **Active Development**: Continuously improved with modern inference techniques Theory Behind Topic Models =========================== Topic models are statistical models that discover abstract "topics" in a collection of documents. Each topic is a distribution over words, and each document is a mixture of topics. **Poisson Factorization (PF)** is the foundational model in this package. Unlike LDA (which uses a multinomial distribution), PF uses a Poisson distribution to model document word counts. This has several advantages: - More natural for count data - Computational efficiency - Extension flexibility for covariates and side information **Guided variants** (SPF, CPF, CSPF) extend the basic model to incorporate: - **Domain knowledge** through keyword priors - **Auxiliary information** through document-level covariates - **Combined constraints** for sophisticated analyses **Ideal point models** (TBIP, STBS) extend the framework to estimate positions of authors on latent dimensions based on their language use, with STBS adding topic-specific ideal points and author-level covariate effects. **Embedded models** (ETM) incorporate pre-trained word embeddings to improve semantic coherence. Getting Help ============ - **First time?** Start with :doc:`../getting_started/index` - **Want details?** Check the :doc:`../fundamentals/index` - **Need examples?** See the :doc:`../examples_guide/index` - **API questions?** Refer to :doc:`../api/index` - **Contributing?** Read :doc:`../contributing_guide/index`