API Reference

Complete API documentation for poisson-topicmodels.

Note

For auto-generated class and method documentation from docstrings, see API Reference (Auto-generated).

Module Organization

from poisson_topicmodels import (
    # Models
    PF,       # Poisson Factorization (unsupervised)
    SPF,      # Seeded Poisson Factorization (guided)
    CPF,      # Covariate Poisson Factorization (with metadata)
    CSPF,     # Covariate Seeded Poisson Factorization (both)
    ETM,      # Embedded Topic Models (with embeddings)
    TBIP,     # Text-Based Ideal Points (author positions)
    STBS,     # Structured Text-Based Scaling (topic-specific ideal points)
    # Base classes
    NumpyroModel,
    Metrics,
)

Model API Pattern

All models follow the same workflow:

1. Initialize

model = PF(counts=dtm, vocab=vocab, num_topics=10, batch_size=32)

2. Train

params = model.train_step(num_steps=200, lr=0.01, random_seed=42)

3. Summarize

model.summary()              # Formatted text summary of the fitted model

4. Extract

topics, e_theta = model.return_topics()         # Dominant topic per doc + proportions
beta = model.return_beta()                       # Word–topic DataFrame (words × topics)
top_words = model.return_top_words_per_topic(n=10)  # dict {topic_id: [words]}

5. Evaluate

coherence_df = model.compute_topic_coherence()   # NPMI coherence per topic
diversity = model.compute_topic_diversity()       # Fraction of unique top words (0–1)

6. Visualize

model.plot_model_loss()               # Training loss curve
model.plot_topic_prevalence()         # Mean topic prevalence bar chart
model.plot_topic_correlation()        # Cosine-similarity heatmap
model.plot_document_topic_heatmap()   # Document × topic heatmap
model.plot_topic_wordclouds()         # Wordcloud per topic

Common Parameters

Data

counts (csr_matrix): Document-term matrix (documents × terms)
vocab (ndarray): Vocabulary terms, shape (num_words,)

Model configuration

num_topics (int): Number of topics to discover (PF, CPF, TBIP, STBS, ETM)
keywords (dict): Seed words per topic (SPF, CSPF)
residual_topics (int): Extra unsupervised topics (SPF, CSPF)
X_design_matrix (ndarray | DataFrame): Covariates (document-level for CPF/CSPF, author-level for STBS)
authors (ndarray): Author labels per document (TBIP, STBS)
embeddings_mapping (dict): Word → embedding vector (ETM)

Training

num_steps (int): Training iterations
lr (float): Learning rate (step size for optimizer)
batch_size (int): Documents per training step
random_seed (int): Reproducibility seed (supported by PF/SPF/CPF/CSPF/ETM)

Flexible inference inputs

hyperparams (dict): Override prior hyperparameters used inside a model, such as a_beta, b_beta, a_theta, or b_theta.
initparams (dict): Provide custom starting values for variational parameters in the guide, such as beta_shape, beta_rate, theta_shape, or theta_rate.
constantparams (dict): Fix latent variables, such as beta or theta, instead of estimating them with SVI.

Flexible Model Inputs

PF, SPF, CPF, CSPF, TBIP, and STBS accept hyperparams, initparams, and constantparams at initialization. These dictionaries are useful when you need stronger priors, warm starts, ablation studies, or partially fixed model components.

model = PF(
    counts=dtm,
    vocab=vocab,
    num_topics=10,
    batch_size=32,
    hyperparams={"a_beta": 0.5, "b_beta": 1.0},
    initparams={
        "beta_shape": np.ones((10, dtm.shape[1]), dtype=np.float32),
        "beta_rate": np.ones((10, dtm.shape[1]), dtype=np.float32),
    },
    constantparams={
        # "beta": fixed_topic_word_matrix,
    },
)

params = model.train_step(num_steps=200, lr=0.01, random_seed=42)
configured = model.input_params()
print(configured["initialized_variables"].keys())
print(configured["latent_constant_variables"].keys())
print(configured["hyperparameters"].keys())

Validation rules:

Initial values and constants must be concrete array-like values, not callables.
Initial values must match the expected variational parameter shape.
Constant latent variables must match the expected latent-variable shape.
Positive priors and positive latent variables must be strictly greater than zero.
input_params() returns copies of the registered dictionaries. Values provided in the constructor appear immediately; defaults are registered when model and guide code runs, usually during training.

Common Methods (all models)

``train_step(num_steps, lr, random_seed=None, …)``

Train the model via Stochastic Variational Inference (SVI).

Returns: dict of estimated parameters.

Note: TBIP and STBS currently expose train_step(num_steps, lr) without a random_seed argument.

``return_topics()``

Returns (categories, E_theta) — dominant topic per document and document-topic proportions.

``return_beta()``

Returns a pd.DataFrame of word–topic associations (words × topics).

``return_top_words_per_topic(n=10)``

Returns a dict mapping topic identifiers to their top-n words.

``input_params()``

Returns a dict with initialized_variables, latent_constant_variables, and hyperparameters.

``summary(n_top_words=5)``

Prints a formatted summary of the fitted model, including loss, top words, and model-specific details.

``compute_topic_coherence(texts=None, metric=’c_npmi’, top_n=10)``

Computes per-topic coherence scores (NPMI or UMass).

Returns: pd.DataFrame with topic and coherence columns.

``compute_topic_diversity(top_n=25)``

Fraction of unique words across all topics’ top-n lists. Range 0–1.

``plot_model_loss(window=10, save_path=None)``

Line chart of training loss (raw + smoothed). Returns (fig, ax).

``plot_topic_prevalence(save_path=None)``

Horizontal bar chart of mean topic prevalence. Returns (fig, ax).

``plot_topic_correlation(save_path=None)``

Cosine-similarity heatmap between topics. Returns (fig, ax).

``plot_document_topic_heatmap(n_docs=50, sort_by_topic=False, save_path=None)``

Document × topic heatmap. Returns (fig, ax).

``plot_topic_wordclouds(n_words=50, figsize=(16,12), save_path=None)``

One wordcloud per topic. Returns (fig, axes).

SPF-specific Methods

``plot_seed_effectiveness(save_path=None)``

Grouped bar chart comparing mean seed vs. non-seed word weights per topic.

Returns: (fig, axes).

CPF-specific Methods

``return_covariate_effects()``

Point estimates of covariate effect matrix λ (covariates × topics).

Returns: pd.DataFrame.

``return_covariate_effects_ci(ci=0.95)``

Covariate effects with Bayesian credible intervals.

Returns: pd.DataFrame with columns covariate, topic, mean, lower, upper.

``plot_cov_effects(ci=0.95, topics=None, save_path=None)``

Forest plot of covariate effects with credible intervals.

Returns: (fig, axes).

CSPF-specific Methods

Inherits all SPF methods (seeded topics) plus all CPF methods (covariate effects):

return_covariate_effects()
return_covariate_effects_ci(ci=0.95)
plot_cov_effects(ci=0.95, ...)

TBIP-specific Methods

``return_ideal_points()``

Returns a pd.DataFrame with columns author, ideal_point, std, sorted by ideal point.

``return_ideological_words(topic, n=10)``

Top-n words with the strongest ideological loading (η) for a given topic.

Returns: pd.DataFrame with columns word, eta, direction.

``plot_ideal_points(selected_authors=None, show_ci=False, ci=0.95, save_path=None)``

1-D scatter of author ideal points with optional credible intervals.

Returns: (fig, ax).

STBS-specific Methods

``return_ideal_points()``

Returns a pd.DataFrame with columns author, topic, ideal_point, std for topic-specific author positions.

``return_ideal_covariates()``

Returns a pd.DataFrame with columns covariate, topic, iota, std for covariate effects on ideological positions.

``plot_author_topic_heatmap(…)``

Heatmap of mean normalized author-topic intensities.

Returns: (fig, ax).

``plot_ideol_points(…)``

Dot plot of author ideology by topic, with optional grouping overlays.

Returns: (fig, ax).

``plot_iota_credible_intervals(ci=0.95, …)``

Credible-interval plot for covariate-topic ideology coefficients.

Returns: (fig, ax).

ETM-specific Methods

ETM overrides return_topics() and return_beta() to use its neural encoder and embedding-based topic–word computation. No additional public methods beyond the common set.

Metrics Dataclass

Metrics tracks training diagnostics per model instance:

loss (list): ELBO loss per training step
coherence_scores (pd.DataFrame | None): Per-topic coherence if computed
diversity (float | None): Topic diversity if computed
reset(): Clear all stored metrics

Error Handling

Models validate inputs and provide clear error messages:

try:
    model = PF(counts, vocab, num_topics=10, batch_size=32)
except ValueError as e:
    print(f"Invalid input: {e}")

Type Hints

All functions include type hints for IDE support and static analysis.

Performance Notes

Use sparse matrices (CSR format) for large vocabularies
GPU acceleration requires JAX_PLATFORMS=gpu
Batch size affects memory usage and speed
See Tutorial: GPU Acceleration for optimization

API Stability

Public API (what you import) is stable
Internal implementation may change
Breaking changes documented in release notes

Next Steps

Auto-generated docs: API Reference (Auto-generated)
Learn models: Fundamentals
Train models: Tutorials
Solve tasks: How-To Guides