.. _fundamentals:

================================================================================
Fundamentals
================================================================================

This section covers the core concepts and models in poisson-topicmodels, providing
a deeper understanding of topic modeling and the different model variants available.

.. toctree::
   :maxdepth: 2
   :caption: Fundamentals

   core_concepts
   poisson_factorization
   seeded_models
   covariate_models
   ideal_points
   embedded_models

Overview of Available Models
=============================

The **poisson-topicmodels** package provides several related models addressing different
use cases:

**Unsupervised Baseline**

- **Poisson Factorization (PF)**: Discover topics without guidance or external information

**Guided Discovery**

- **Seeded PF (SPF)**: Incorporate domain knowledge through keyword priors

**Covariate Modeling**

- **Covariate PF (CPF)**: Model how topics are influenced by document-level covariates
- **Covariate Seeded PF (CSPF)**: Combine covariate effects with keyword guidance

**Advanced Models**

- **Text-Based Ideal Points (TBIP)**: Estimate author positions on latent dimensions
- **Embedded Topic Models (ETM)**: Integrate pre-trained word embeddings

Which Model Should I Use?
==========================

**Your question**: "I want to discover topics in my text data"

→ **Use**: Poisson Factorization (PF) or Seeded PF (SPF)

- **PF** if you have no prior knowledge about topics
- **SPF** if you can define some keyword seeds for expected topics

**Your question**: "I want to understand how topics vary by document attributes"

→ **Use**: Covariate PF (CPF) or Covariate Seeded PF (CSPF)

- Use when you have metadata (author, date, category) and want to model how topics
  are affected by these attributes

**Your question**: "I want to estimate author or speaker positions"

→ **Use**: Text-Based Ideal Points (TBIP)

- Model ideal points (positions on a latent scale) based on language use
- Useful for political polarization, sentiment analysis across authors

**Your question**: "I want to use pre-trained word embeddings"

→ **Use**: Embedded Topic Models (ETM)

- Incorporates semantic information from embeddings like Word2Vec or FastText
- Often produces more semantically coherent topics

Model Comparison Table
======================

.. list-table:: Model Comparison
   :widths: 20 15 10 15 15
   :header-rows: 1

   * - Model
     - Unsupervised?
     - Guides?
     - Covariates?
     - Embeddings?
   * - **PF**
     - ✓
     -
     -
     -
   * - **SPF**
     - ✓ (guided)
     - ✓
     -
     -
   * - **CPF**
     - ✓
     -
     - ✓
     -
   * - **CSPF**
     - ✓ (guided)
     - ✓
     - ✓
     -
   * - **TBIP**
     - ✓
     -
     -
     -
   * - **ETM**
     - ✓
     -
     -
     - ✓

Common Patterns
===============

All models in poisson-topicmodels follow a consistent API:

**Create**: Initialize with data and parameters

.. code-block:: python

   model = PF(counts=counts, vocab=vocab, num_topics=10, batch_size=32)

**Train**: Fit the model to data

.. code-block:: python

   params = model.train_step(num_steps=200, lr=0.01, random_seed=42)

**Summarize**: Get a quick overview of the fitted model

.. code-block:: python

   model.summary()

**Extract**: Get interpretable results

.. code-block:: python

   categories, e_theta = model.return_topics()
   beta = model.return_beta()
   top_words = model.return_top_words_per_topic(n=10)

**Evaluate**: Quantitative diagnostics

.. code-block:: python

   coherence_df = model.compute_topic_coherence()
   diversity = model.compute_topic_diversity()

**Visualize**: Built-in publication-ready plots

.. code-block:: python

   model.plot_model_loss()
   model.plot_topic_prevalence()
   model.plot_topic_correlation()
   model.plot_document_topic_heatmap()
   model.plot_topic_wordclouds()

Probabilistic Background
=========================

All models in this package are built on **Poisson Factorization**, a probabilistic
framework for count data. Here's the core idea:

**Poisson Factorization Model**

For each document d and word w:

- **Observed**: word count $C_{dw}$ (how many times word w appears in document d)
- **Latent**: topic z (which topic generated this word)
- **Model**: $C_{dw} \sim \text{Poisson}(\sum_z \beta_z^w \theta_d^z)$

Where:
- $\beta_z^w$ = word w probability in topic z
- $\theta_d^z$ = topic z intensity in document d

**Why Poisson?**

- Natural for count data
- Mathematically elegant with exponential family
- Computationally efficient with SVI
- Flexible foundation for extensions

**Bayesian Inference**

We use **Stochastic Variational Inference (SVI)** to estimate posterior distributions:

- Place prior distributions on $\beta$ and $\theta$
- Learn approximate posterior through variational optimization
- Mini-batch training for scalability

Learn More
==========

Each model variant has dedicated documentation:

- :doc:`core_concepts` - Detailed statistical background
- :doc:`poisson_factorization` - PF and understanding results
- :doc:`seeded_models` - SPF for incorporating domain knowledge
- :doc:`covariate_models` - CPF and CSPF for structured data
- :doc:`ideal_points` - TBIP for position estimation
- :doc:`embedded_models` - ETM for embedding integration

Ready to dive deeper? Start with :doc:`core_concepts`.