Getting Started

Welcome to poisson-topicmodels! This guide will get you up and running in about 5 minutes.

Quickstart: Your First Topic Model

Let’s walk through a complete example from data preparation to model interpretation.

Step 1: Import Required Libraries

import numpy as np
from scipy.sparse import csr_matrix
from poisson_topicmodels import PF

Step 2: Prepare Your Data

Topic models work with a document-term matrix (documents × vocabulary terms) and a vocabulary list.

# Create sample data: 100 documents, 500 vocabulary terms
# In practice, you would load your own text data
np.random.seed(42)
counts = csr_matrix(np.random.poisson(2, (100, 500)).astype(np.float32))
vocab = np.array([f'word_{i}' for i in range(500)])

print(f"Document-term matrix shape: {counts.shape}")
print(f"Vocabulary size: {len(vocab)}")

Data format: The document-term matrix should be a sparse matrix (rows = documents, columns = vocabulary terms) with non-negative integer counts.

Step 3: Initialize and Train the Model

# Create a Poisson Factorization model with 10 topics
model = PF(
    counts=counts,
    vocab=vocab,
    num_topics=10,
    batch_size=32,
)

# Train for 200 steps with learning rate 0.01
params = model.train_step(
    num_steps=200,
    lr=0.01,
    random_seed=42,
)

Step 4: Extract and Interpret Results

# Quick summary of the fitted model
model.summary()

# Get word-topic associations
beta = model.return_beta()  # DataFrame: words × topics
print(f"Beta shape: {beta.shape}")

# Get top words for each topic
top_words = model.return_top_words_per_topic(n=10)
print("\nTop 10 words per topic:")
for topic_id, words in top_words.items():
    print(f"Topic {topic_id}: {', '.join(words)}")

# Get document-topic distributions
categories, e_theta = model.return_topics()
print(f"\nDocument-topic matrix shape: {e_theta.shape}")
print(f"Dominant topic for first doc: {categories[0]}")

Understanding the Output

Beta (return_beta())

DataFrame of shape (vocabulary_size, num_topics).

Each column is a topic: word-level association weights.

Topics (return_topics())

Returns (categories, E_theta) — dominant topic per document and the full document-topic proportions matrix.

Top Words (return_top_words_per_topic(n))

A dict mapping topic identifiers to their top-n words, useful for interpretation.

Summary (summary())

Prints a formatted overview of the model: loss, top words, and model-specific details.

Configuring Priors and Initial Values

Recent versions let you control inference more directly without subclassing a model. Pass optional dictionaries when creating a model:

model = PF(
    counts=counts,
    vocab=vocab,
    num_topics=10,
    batch_size=32,
    hyperparams={"a_beta": 0.5, "b_beta": 1.0},
    initparams={
        "beta_shape": np.ones((10, counts.shape[1]), dtype=np.float32),
        "beta_rate": np.ones((10, counts.shape[1]), dtype=np.float32),
    },
    constantparams={
        # Example: "beta": fixed_topic_word_matrix
    },
)

hyperparams overrides model priors, initparams provides starting values for variational parameters, and constantparams fixes latent variables so they are not updated by SVI. Shapes are validated against the model’s expected latent variables and variational parameters.

Use input_params() to inspect what a model has registered:

params = model.train_step(num_steps=200, lr=0.01, random_seed=42)
print(model.input_params()["initialized_variables"].keys())
print(model.input_params()["latent_constant_variables"].keys())
print(model.input_params()["hyperparameters"].keys())

Constructor-provided settings appear immediately. Default keys are populated when the model and guide execute, usually during training.

Next Steps

Now that you have a working model, explore:

Model Variants: Fundamentals Explore all models like seeded PF (SPF) for guided discovery and STBS for topic-specific ideology.
Training & Configuration: Tutorials Understand training options, hyperparameters, and GPU acceleration.
Practical Recipes: How-To Guides Common tasks and advanced workflows.
Advanced Usage: How-To Guides Extract results, customize inference, and integrate with your pipeline.

Complete Example with Real-ish Data

Here’s a more realistic example with synthetic documents that have meaningful structure:

import numpy as np
from scipy.sparse import csr_matrix
from poisson_topicmodels import PF

# Create synthetic documents with 3 underlying topics
np.random.seed(42)
num_docs = 200
num_words = 1000
num_topics = 3

# Define some "topic-specific" words
topic_words = {
    0: list(range(0, 100)),      # Words 0-99 for topic 0
    1: list(range(100, 200)),    # Words 100-199 for topic 1
    2: list(range(200, 300)),    # Words 200-299 for topic 2
}

# Generate documents with topic structure
counts_list = []
for doc_id in range(num_docs):
    # Each document is a mixture of topics
    topic_dist = np.random.dirichlet([1] * num_topics)
    word_counts = np.zeros(num_words)

    for topic_id in range(num_topics):
        topic_weight = topic_dist[topic_id]
        words_in_topic = topic_words[topic_id]
        for _ in range(int(50 * topic_weight)):
            word_id = np.random.choice(words_in_topic)
            word_counts[word_id] += 1

    counts_list.append(word_counts)

counts = csr_matrix(np.array(counts_list).astype(np.float32))
vocab = np.array([f'word_{i}' for i in range(num_words)])

# Train model with matching number of topics
model = PF(counts, vocab, num_topics=3, batch_size=32)
params = model.train_step(num_steps=200, lr=0.01, random_seed=42)

# The model should discover the 3 underlying topics
model.summary()
top_words = model.return_top_words_per_topic(n=20)
for topic_id, words in top_words.items():
    print(f"\nTopic {topic_id}:")
    print(f"  {', '.join(words[:10])}")

Key Concepts

Document-Term Matrix: The core input format: a sparse matrix where rows are documents and columns are vocabulary terms, containing word counts.
Topics: Latent variables representing abstract themes. Each topic is a distribution over words.
Topic Modeling: Statistical technique to discover and analyze latent topics in text data.
Stochastic Variational Inference (SVI): Efficient training method that processes documents in small batches (mini-batch training).
GPU Acceleration: Computations run on GPU (if available) for significant speedups on large datasets.

Common Parameters

num_topics: Number of topics to discover

batch_size: Documents processed per training step

num_iterations: Training iterations

learning_rate: Step size for optimization

random_seed: For reproducibility

Tips for Best Results

Tune the Number of Topics: Start with 10-20 topics, adjust based on coherence
Use a Good Batch Size: Larger batches (256+) for stability, smaller (32) for faster iterations
Monitor Training: Check that loss decreases smoothly
Validate Topics: Read top words to verify topics make sense
Use GPU: If available, GPU acceleration provides 10-100x speedup
Set Random Seed: For reproducibility in research

What’s Next?

Read more examples: See Examples & Applications
Explore other models: Check Fundamentals for seeded, covariate, and embedded variants
Learn advanced techniques: Visit Tutorials
Check the API: Refer to API Reference for detailed documentation

Having Issues?

Check Installation for installation troubleshooting
Read How-To Guides for common tasks
Explore the full API Reference reference