.. _core_concepts: ================================================================================ Core Concepts ================================================================================ This page introduces the fundamental concepts underlying probabilistic topic modeling and poisson-topicmodels. What is Topic Modeling? ======================= **Topic modeling** is a statistical technique for discovering abstract "topics" that occur in a collection of documents. Key Idea: - Each **topic** is a distribution over words (some words more likely than others) - Each **document** is a mixture of topics (can contain multiple topics) - We observe word counts in documents and infer the hidden topic structure Example: Imagine 3 documents about science, politics, and cuisine. The model might discover: - **Topic 1** (Science): "research", "experiment", "data", "variable"... - **Topic 2** (Politics): "government", "vote", "policy", "election"... - **Topic 3** (Cuisine): "recipe", "cook", "ingredient", "flavor"... Document 1 (Science paper): 80% Topic 1, 10% Topic 2, 10% Topic 3 Document 2 (Political cookbook): 30% Topic 1, 50% Topic 2, 20% Topic 3 Document 3 (Cooking blog): 5% Topic 1, 5% Topic 2, 90% Topic 3 Document-Term Matrix ==================== The fundamental input to all topic models is a **document-term matrix** (DTM): - **Rows**: Documents - **Columns**: Vocabulary terms (words) - **Values**: Word counts in each document Example (5 documents × 10 vocabulary): .. code-block:: text Document | word_1 | word_2 | word_3 | ... | word_10 ----------+--------+--------+--------+-----+--------- doc_1 | 3 | 0 | 5 | ... | 1 doc_2 | 1 | 2 | 0 | ... | 4 doc_3 | 0 | 7 | 2 | ... | 0 ... | .. | .. | .. | ... | .. doc_5 | 2 | 1 | 3 | ... | 6 **Sparse Format**: In practice, DTMs are **very sparse** (mostly zeros) because documents use only a small fraction of vocabulary. We use sparse matrix formats (CSR, CSC) for efficiency. Vocabulary ========== The **vocabulary** is the complete list of unique words (terms) in your corpus. - Size depends on preprocessing: 500 - 100,000+ words - Typically includes preprocessing: - Lowercasing ("Hello" → "hello") - Removing punctuation - Stopword removal ("the", "a", "is") - Stemming/lemmatization ("running" → "run") Example vocabulary: .. code-block:: python vocab = np.array([ 'research', # word_0 'data', # word_1 'experiment', # word_2 'science', # word_3 ... 'cooking' # word_999 ]) Topics and Word Distributions ============================== A **topic** is a probability distribution over vocabulary terms. Example topic (topic_2): .. code-block:: python P(word | topic_2) = { 'research': 0.08, 'data': 0.07, 'experiment': 0.06, 'cooking': 0.001, 'science': 0.05, ... } Top words for this topic: 'research', 'data', 'experiment', 'science', ... **Interpretation**: High-probability words characterize the topic; low-probability words are just noise. Topics are represented as vectors: .. code-block:: python topic_2 = np.array([0.08, 0.07, 0.06, 0.05, ...]) # shape: (vocab_size,) Document-Topic Mixtures ======================= A **document** is a mixture of topics - a probability distribution over topics. Example document: .. code-block:: python P(topic | doc_1) = { 'topic_0': 0.60, # Science 'topic_1': 0.25, # Commerce/Business 'topic_2': 0.15, # Politics } Interpretation: This document is primarily about science (60%), some business (25%), and a bit about politics (15%). Represented as a vector: .. code-block:: python doc_1_topics = np.array([0.60, 0.25, 0.15]) # shape: (num_topics,) The Complete Picture ==================== Combined view: .. code-block:: text Documents Topics (β) Document-Topic (θ) [word counts] → [matrix mult] → [topic mixture] d1: [5, 2, 3, ...] β_0: [0.1, 0.05, 0.02, ...] θ_d1: [0.60, 0.25, 0.15] d2: [0, 8, 1, ...] × β_1: [0.02, 0.08, 0.06, ...] = θ_d2: [0.25, 0.50, 0.25] d3: [2, 1, 7, ...] β_2: [0.05, 0.02, 0.09, ...] θ_d3: [0.15, 0.10, 0.75] β_K: [0.01, 0.03, 0.02, ...] Poisson Factorization Model ============================ The core model in poisson-topicmodels is **Poisson Factorization** (PF). **Generative Process** (how data is created): 1. For each document d: - Draw document-topic distribution: $\theta_d \sim \text{Gamma}(\alpha, \alpha)^K$ 2. For each topic k: - Draw topic-word distribution: $\beta_k \sim \text{Gamma}(\eta, \eta)^V$ 3. For each document-word pair (d, w): - Draw count: $C_{dw} \sim \text{Poisson}(\sum_k \theta_d^k \beta_k^w)$ Where: - $C_{dw}$ = observed word count - $\theta_d^k$ = intensity of topic k in document d - $\beta_k^w$ = intensity of word w in topic k - K = number of topics - V = vocabulary size **Why Poisson?** Traditional LDA uses: - Multinomial: Exactly K topics per document - Hierarchical Dirichlet: Complex sampling Poisson factorization: - Natural for count data - Linear combination of topic-word factors - Efficient SVI training - Flexible for extensions Inference: Learning from Data ================================ We observe: - Document-term matrix: $\{C_{dw}\}$ for all d, w We want to learn: - Topics: $\{\beta_k\}$ - what each topic is about - Document-topics: $\{\theta_d\}$ - topic mixtures per document **Bayesian Approach**: 1. Place priors: $\theta_d, \beta_k \sim \text{prior}$ 2. Compute posterior: $P(\theta, \beta | C)$ given observed data 3. Optimize: Maximize evidence lower bound (ELBO) with SVI This is done via **Stochastic Variational Inference (SVI)**: - Approximate posterior with mean-field variational family - Update with mini-batches of documents - Converge to local optimum Hyperparameters ================ Each model has hyperparameters controlling the inference process: **Learning Rate** (lr) Controls step size in optimization. Typical range: 0.001 - 0.1 - Higher → faster learning but less stable - Lower → slower but more stable **Number of Topics** (K) How many topics to discover. No universal "right" answer. - Start with 10-20 - Evaluate using coherence, perplexity, or domain knowledge **Batch Size** Documents per training iteration. Typical: 32, 64, 128, 256 - Larger → more stable gradients, slower iterations - Smaller → noisier but faster iterations **Iterations/Epochs** How long to train. Usually 100-1000 iterations until convergence. Convergence and Loss ==================== Training is monitored through **loss** (negative ELBO): - **Early iterations**: Loss decreases rapidly (large changes) - **Late iterations**: Loss decreases slowly (fine-tuning) - **Convergence**: Loss plateaus (further training unlikely to help) Example learning curve: .. code-block:: text Loss ^ | .-' (Learning plateau) | /' | .' (Steep learning) |.' |___________________> Iteration **Early stopping**: Stop when loss plateus rather than training to fixed number of iterations. Interpreting Topics ==================== After training, topics are interpretable through their top words: **High-quality topic**: Top words form coherent theme .. code-block:: text Topic 5: research, experiment, data, variable, analysis, hypothesis Topic 12: president, congress, vote, senator, legislation, party **Low-quality topic**: Top words scattered or similar across topics .. code-block:: text Topic 3: the, of, and, to, a, in (mostly stopwords - bad!) Topic 7: research, data, president, cooking, vote (incoherent) Quality assessment: - Manual inspection of top words - Coherence metrics - Domain expert evaluation - Downstream task performance Model Assumptions ================= Topic models make several assumptions—understanding these helps you use them effectively: **Bag-of-Words**: Word order doesn't matter - "cat chases mouse" = "mouse chases cat" - Topic models only see word counts, not sequences - Good for thematic analysis, not for capturing narrative structure **Mixture Assumption**: Documents are mixtures of topics - Not all documents follow this (some are single-topic) - More relevant for long documents (100+ words) than short snippets **Topic Independence**: Topics are independent - In reality, some topics co-occur - Model learns this through document-topic mixtures **Homogeneous Vocabulary**: Same vocabulary across all documents - Strong assumption but rarely problematic in practice Common Pitfalls ================ **Too many topics**: Topics become redundant and incoherent **Too few topics**: Topics are vague, combining distinct themes **Poor preprocessing**: Stopwords or junk data create low-quality topics **Short documents**: Sparse word counts → unreliable inference (need >50 words typically) **Wrong batch size**: Too large → slow iterations; too small → noisy updates **Insufficient training**: Model hasn't converged **No validation**: Blindly trusting discovered topics without inspection Next Steps ========== - :doc:`poisson_factorization` - Deep dive into Poisson Factorization model - :doc:`seeded_models` - How to guide discovery with keywords - :doc:`covariate_models` - Modeling topic structure with metadata - :doc:`../api/index` - API reference for implementation details