Core Concepts

This page introduces the fundamental concepts underlying probabilistic topic modeling and poisson-topicmodels.

What is Topic Modeling?

Topic modeling is a statistical technique for discovering abstract “topics” that occur in a collection of documents.

Key Idea:

  • Each topic is a distribution over words (some words more likely than others)

  • Each document is a mixture of topics (can contain multiple topics)

  • We observe word counts in documents and infer the hidden topic structure

Example:

Imagine 3 documents about science, politics, and cuisine. The model might discover:

  • Topic 1 (Science): “research”, “experiment”, “data”, “variable”…

  • Topic 2 (Politics): “government”, “vote”, “policy”, “election”…

  • Topic 3 (Cuisine): “recipe”, “cook”, “ingredient”, “flavor”…

Document 1 (Science paper): 80% Topic 1, 10% Topic 2, 10% Topic 3 Document 2 (Political cookbook): 30% Topic 1, 50% Topic 2, 20% Topic 3 Document 3 (Cooking blog): 5% Topic 1, 5% Topic 2, 90% Topic 3

Document-Term Matrix

The fundamental input to all topic models is a document-term matrix (DTM):

  • Rows: Documents

  • Columns: Vocabulary terms (words)

  • Values: Word counts in each document

Example (5 documents × 10 vocabulary):

Document  | word_1 | word_2 | word_3 | ... | word_10
----------+--------+--------+--------+-----+---------
doc_1     |   3    |   0    |   5    | ... |    1
doc_2     |   1    |   2    |   0    | ... |    4
doc_3     |   0    |   7    |   2    | ... |    0
...       |   ..   |   ..   |   ..   | ... |   ..
doc_5     |   2    |   1    |   3    | ... |    6

Sparse Format: In practice, DTMs are very sparse (mostly zeros) because documents use only a small fraction of vocabulary. We use sparse matrix formats (CSR, CSC) for efficiency.

Vocabulary

The vocabulary is the complete list of unique words (terms) in your corpus.

  • Size depends on preprocessing: 500 - 100,000+ words

  • Typically includes preprocessing: - Lowercasing (“Hello” → “hello”) - Removing punctuation - Stopword removal (“the”, “a”, “is”) - Stemming/lemmatization (“running” → “run”)

Example vocabulary:

vocab = np.array([
    'research',     # word_0
    'data',         # word_1
    'experiment',   # word_2
    'science',      # word_3
    ...
    'cooking'       # word_999
])

Topics and Word Distributions

A topic is a probability distribution over vocabulary terms.

Example topic (topic_2):

P(word | topic_2) = {
    'research': 0.08,
    'data': 0.07,
    'experiment': 0.06,
    'cooking': 0.001,
    'science': 0.05,
    ...
}

Top words for this topic: ‘research’, ‘data’, ‘experiment’, ‘science’, …

Interpretation: High-probability words characterize the topic; low-probability words are just noise.

Topics are represented as vectors:

topic_2 = np.array([0.08, 0.07, 0.06, 0.05, ...])  # shape: (vocab_size,)

Document-Topic Mixtures

A document is a mixture of topics - a probability distribution over topics.

Example document:

P(topic | doc_1) = {
    'topic_0': 0.60,  # Science
    'topic_1': 0.25,  # Commerce/Business
    'topic_2': 0.15,  # Politics
}

Interpretation: This document is primarily about science (60%), some business (25%), and a bit about politics (15%).

Represented as a vector:

doc_1_topics = np.array([0.60, 0.25, 0.15])  # shape: (num_topics,)

The Complete Picture

Combined view:

Documents              Topics (β)        Document-Topic (θ)

[word counts]  →  [matrix mult]  →   [topic mixture]

d1: [5, 2, 3, ...]     β_0: [0.1, 0.05, 0.02, ...]     θ_d1: [0.60, 0.25, 0.15]
d2: [0, 8, 1, ...]  ×  β_1: [0.02, 0.08, 0.06, ...]  =  θ_d2: [0.25, 0.50, 0.25]
d3: [2, 1, 7, ...]     β_2: [0.05, 0.02, 0.09, ...]     θ_d3: [0.15, 0.10, 0.75]
                       β_K: [0.01, 0.03, 0.02, ...]

Poisson Factorization Model

The core model in poisson-topicmodels is Poisson Factorization (PF).

Generative Process (how data is created):

  1. For each document d:

    • Draw document-topic distribution: $theta_d sim text{Gamma}(alpha, alpha)^K$

  2. For each topic k:

    • Draw topic-word distribution: $beta_k sim text{Gamma}(eta, eta)^V$

  3. For each document-word pair (d, w):

    • Draw count: $C_{dw} sim text{Poisson}(sum_k theta_d^k beta_k^w)$

Where: - $C_{dw}$ = observed word count - $theta_d^k$ = intensity of topic k in document d - $beta_k^w$ = intensity of word w in topic k - K = number of topics - V = vocabulary size

Why Poisson?

Traditional LDA uses: - Multinomial: Exactly K topics per document - Hierarchical Dirichlet: Complex sampling

Poisson factorization: - Natural for count data - Linear combination of topic-word factors - Efficient SVI training - Flexible for extensions

Inference: Learning from Data

We observe:

  • Document-term matrix: ${C_{dw}}$ for all d, w

We want to learn:

  • Topics: ${beta_k}$ - what each topic is about

  • Document-topics: ${theta_d}$ - topic mixtures per document

Bayesian Approach:

  1. Place priors: $theta_d, beta_k sim text{prior}$

  2. Compute posterior: $P(theta, beta | C)$ given observed data

  3. Optimize: Maximize evidence lower bound (ELBO) with SVI

This is done via Stochastic Variational Inference (SVI):

  • Approximate posterior with mean-field variational family

  • Update with mini-batches of documents

  • Converge to local optimum

Hyperparameters

Each model has hyperparameters controlling the inference process:

Learning Rate (lr)

Controls step size in optimization. Typical range: 0.001 - 0.1

  • Higher → faster learning but less stable

  • Lower → slower but more stable

Number of Topics (K)

How many topics to discover. No universal “right” answer.

  • Start with 10-20

  • Evaluate using coherence, perplexity, or domain knowledge

Batch Size

Documents per training iteration. Typical: 32, 64, 128, 256

  • Larger → more stable gradients, slower iterations

  • Smaller → noisier but faster iterations

Iterations/Epochs

How long to train. Usually 100-1000 iterations until convergence.

Convergence and Loss

Training is monitored through loss (negative ELBO):

  • Early iterations: Loss decreases rapidly (large changes)

  • Late iterations: Loss decreases slowly (fine-tuning)

  • Convergence: Loss plateaus (further training unlikely to help)

Example learning curve:

Loss
^
|     .-'       (Learning plateau)
|    /'
|  .'           (Steep learning)
|.'
|___________________> Iteration

Early stopping: Stop when loss plateus rather than training to fixed number of iterations.

Interpreting Topics

After training, topics are interpretable through their top words:

High-quality topic: Top words form coherent theme

Topic 5: research, experiment, data, variable, analysis, hypothesis
Topic 12: president, congress, vote, senator, legislation, party

Low-quality topic: Top words scattered or similar across topics

Topic 3: the, of, and, to, a, in  (mostly stopwords - bad!)
Topic 7: research, data, president, cooking, vote  (incoherent)

Quality assessment: - Manual inspection of top words - Coherence metrics - Domain expert evaluation - Downstream task performance

Model Assumptions

Topic models make several assumptions—understanding these helps you use them effectively:

Bag-of-Words: Word order doesn’t matter

  • “cat chases mouse” = “mouse chases cat”

  • Topic models only see word counts, not sequences

  • Good for thematic analysis, not for capturing narrative structure

Mixture Assumption: Documents are mixtures of topics

  • Not all documents follow this (some are single-topic)

  • More relevant for long documents (100+ words) than short snippets

Topic Independence: Topics are independent

  • In reality, some topics co-occur

  • Model learns this through document-topic mixtures

Homogeneous Vocabulary: Same vocabulary across all documents

  • Strong assumption but rarely problematic in practice

Common Pitfalls

Too many topics: Topics become redundant and incoherent

Too few topics: Topics are vague, combining distinct themes

Poor preprocessing: Stopwords or junk data create low-quality topics

Short documents: Sparse word counts → unreliable inference (need >50 words typically)

Wrong batch size: Too large → slow iterations; too small → noisy updates

Insufficient training: Model hasn’t converged

No validation: Blindly trusting discovered topics without inspection

Next Steps