Getting Started

Welcome to poisson-topicmodels! This guide will get you up and running in about 5 minutes.

Quickstart: Your First Topic Model

Let’s walk through a complete example from data preparation to model interpretation.

Step 1: Import Required Libraries

import numpy as np
from scipy.sparse import csr_matrix
from poisson_topicmodels import PF

Step 2: Prepare Your Data

Topic models work with a document-term matrix (documents × vocabulary terms) and a vocabulary list.

# Create sample data: 100 documents, 500 vocabulary terms
# In practice, you would load your own text data
np.random.seed(42)
counts = csr_matrix(np.random.poisson(2, (100, 500)).astype(np.float32))
vocab = np.array([f'word_{i}' for i in range(500)])

print(f"Document-term matrix shape: {counts.shape}")
print(f"Vocabulary size: {len(vocab)}")

Data format: The document-term matrix should be a sparse matrix (rows = documents, columns = vocabulary terms) with non-negative integer counts.

Step 3: Initialize and Train the Model

# Create a Poisson Factorization model with 10 topics
model = PF(
    counts=counts,
    vocab=vocab,
    num_topics=10,
    batch_size=32,
    random_seed=42
)

# Train for 200 steps with learning rate 0.01
params = model.train_step(
    num_steps=200,
    lr=0.01
)

Step 4: Extract and Interpret Results

# Quick summary of the fitted model
model.summary()

# Get word-topic associations
beta = model.return_beta()  # DataFrame: words × topics
print(f"Beta shape: {beta.shape}")

# Get top words for each topic
top_words = model.return_top_words_per_topic(n=10)
print("\nTop 10 words per topic:")
for topic_id, words in top_words.items():
    print(f"Topic {topic_id}: {', '.join(words)}")

# Get document-topic distributions
categories, e_theta = model.return_topics()
print(f"\nDocument-topic matrix shape: {e_theta.shape}")
print(f"Dominant topic for first doc: {categories[0]}")

Understanding the Output

Beta (return_beta())

DataFrame of shape (vocabulary_size, num_topics).

Each column is a topic: word-level association weights.

Topics (return_topics())

Returns (categories, E_theta) — dominant topic per document and the full document-topic proportions matrix.

Top Words (return_top_words_per_topic(n))

A dict mapping topic identifiers to their top-n words, useful for interpretation.

Summary (summary())

Prints a formatted overview of the model: loss, top words, and model-specific details.

Next Steps

Now that you have a working model, explore:

  1. Model Variants: Fundamentals Explore all models like seeded PF (SPF) for guided discovery and STBS for topic-specific ideology.

  2. Training & Configuration: Tutorials Understand training options, hyperparameters, and GPU acceleration.

  3. Practical Recipes: How-To Guides Common tasks and advanced workflows.

  4. Advanced Usage: How-To Guides Extract results, customize inference, and integrate with your pipeline.

Complete Example with Real-ish Data

Here’s a more realistic example with synthetic documents that have meaningful structure:

import numpy as np
from scipy.sparse import csr_matrix
from poisson_topicmodels import PF

# Create synthetic documents with 3 underlying topics
np.random.seed(42)
num_docs = 200
num_words = 1000
num_topics = 3

# Define some "topic-specific" words
topic_words = {
    0: list(range(0, 100)),      # Words 0-99 for topic 0
    1: list(range(100, 200)),    # Words 100-199 for topic 1
    2: list(range(200, 300)),    # Words 200-299 for topic 2
}

# Generate documents with topic structure
counts_list = []
for doc_id in range(num_docs):
    # Each document is a mixture of topics
    topic_dist = np.random.dirichlet([1] * num_topics)
    word_counts = np.zeros(num_words)

    for topic_id in range(num_topics):
        topic_weight = topic_dist[topic_id]
        words_in_topic = topic_words[topic_id]
        for _ in range(int(50 * topic_weight)):
            word_id = np.random.choice(words_in_topic)
            word_counts[word_id] += 1

    counts_list.append(word_counts)

counts = csr_matrix(np.array(counts_list).astype(np.float32))
vocab = np.array([f'word_{i}' for i in range(num_words)])

# Train model with matching number of topics
model = PF(counts, vocab, num_topics=3, batch_size=32)
params = model.train_step(num_steps=200, lr=0.01, random_seed=42)

# The model should discover the 3 underlying topics
model.summary()
top_words = model.return_top_words_per_topic(n=20)
for topic_id, words in top_words.items():
    print(f"\nTopic {topic_id}:")
    print(f"  {', '.join(words[:10])}")

Key Concepts

Document-Term Matrix

The core input format: a sparse matrix where rows are documents and columns are vocabulary terms, containing word counts.

Topics

Latent variables representing abstract themes. Each topic is a distribution over words.

Topic Modeling

Statistical technique to discover and analyze latent topics in text data.

Stochastic Variational Inference (SVI)

Efficient training method that processes documents in small batches (mini-batch training).

GPU Acceleration

Computations run on GPU (if available) for significant speedups on large datasets.

Common Parameters

num_topics: Number of topics to discover

batch_size: Documents processed per training step

num_iterations: Training iterations

learning_rate: Step size for optimization

random_seed: For reproducibility

Tips for Best Results

  1. Tune the Number of Topics: Start with 10-20 topics, adjust based on coherence

  2. Use a Good Batch Size: Larger batches (256+) for stability, smaller (32) for faster iterations

  3. Monitor Training: Check that loss decreases smoothly

  4. Validate Topics: Read top words to verify topics make sense

  5. Use GPU: If available, GPU acceleration provides 10-100x speedup

  6. Set Random Seed: For reproducibility in research

What’s Next?

Having Issues?