API Reference (Auto-generated)

class NumpyroModel(initparams=None, constantparams=None, hyperparams=None)[source]

Bases: ABC

Abstract base class for all used probabilistic models. Each model has to implement at least their own Model and Guide.

Parameters:

initparams (Dict[str, Any] | None)
constantparams (Dict[str, Any] | None)
hyperparams (Dict[str, float] | None)

Metrics

Instance metrics tracker (per instance, not shared).

Type:: Metrics

estimated_params

Estimated parameters after training.

Type:: dict

D

Number of documents.

Type:: int

V

Vocabulary size.

Type:: int

batch_size

Mini-batch size for stochastic variational inference.

Type:: int

counts

Document-term matrix.

Type:: scipy.sparse.csr_matrix

vocab

Vocabulary array.

Type:: np.ndarray

K

Number of topics.

Type:: int

Initialize base model with per-instance metrics.

train_step(num_steps, lr, random_seed=None, jit_compile=True, cache_dense_counts=None, dense_cache_max_gb=0.75)[source]

Train the model using Stochastic Variational Inference (SVI).

Parameters:

num_steps (int) – Number of training iterations. Must be > 0.
lr (float) – Learning rate for the optimizer. Must be > 0.
random_seed (Optional[int]) – Seed for JAX random number generator. If provided, ensures reproducible results. Default is None (random initialization).
jit_compile (bool) – Whether to JIT compile SVI updates. Keep enabled for long runs; disable to avoid compile overhead in very short runs.
cache_dense_counts (Optional[bool]) – If True, cache sparse counts as dense array for faster batching. If None, auto-enable when estimated dense matrix size fits in dense_cache_max_gb.
dense_cache_max_gb (float) – Maximum dense cache size in GB used by auto mode.

Returns:

Estimated parameters after training.

Return type:

Dict[str, Any]

Raises:

ValueError – If num_steps <= 0 or lr <= 0.

input_params()[source]

Return user-defined variable settings.

Return type:: dict

return_topics()[source]

Return the topics for each document.

Return type:

Tuple[ndarray, ndarray]

Returns:

categories (np.ndarray) – Array of topic indices for each document (shape: D,).
E_theta (np.ndarray) – Estimated topic proportions for each document (shape: D, K).

Raises:

ValueError – If model has not been trained yet (no estimated parameters).

return_beta()[source]

Return the beta matrix (word-topic associations) for the model.

Returns:: DataFrame with words as index and topics as columns, containing word-topic probability estimates.
Return type:: DataFrame
Raises:: ValueError – If model has not been trained yet (no estimated parameters).

return_top_words_per_topic(n=10)[source]

plot_model_loss(window=10, save_path=None)[source]

Plot the training loss over time with full and smoothed curves.

Parameters:

window (int) – Window size for moving average smoothing. Default is 10.
save_path (Optional[str]) – Path to save the figure.

Return type:

Tuple[Figure, Axes]

plot_topic_wordclouds(n_words=50, figsize=(16, 12), save_path=None)[source]

Plot wordclouds for each topic based on beta values.

Parameters:

n_words (int) – Maximum number of words per wordcloud (default 50).
figsize (Tuple[int, int]) – Figure size (width, height) (default (16, 12)).
save_path (Optional[str]) – Path to save the figure.

Return type:

Tuple[Figure, ndarray]

summary(n_top_words=5)[source]

Return a formatted text summary of the fitted model.

Includes model class name, dimensions, loss trajectory, and top words per topic. Subclasses can extend the output by overriding _summary_extra().

Parameters:: n_top_words (int) – Number of top words to show per topic (default 5).
Returns:: Multi-line summary string.
Return type:: str

compute_topic_coherence(texts=None, metric='c_npmi', top_n=10)[source]

Compute topic coherence scores (NPMI or UMass).

Parameters:

texts (Optional[List[List[str]]]) – Tokenised reference corpus. If None, word co-occurrence is estimated from self.counts and self.vocab.
metric (str) – Coherence measure (default 'c_npmi').
top_n (int) – Number of top words per topic used for the calculation (default 10).

Returns:

DataFrame with columns ['topic', 'coherence'].

Return type:

DataFrame

compute_topic_diversity(top_n=25)[source]

Compute topic diversity (Dieng et al., 2020).

Measures the fraction of unique words across all topics’ top-n lists. Values near 1.0 indicate diverse topics; near 0 indicates redundancy.

Parameters:: top_n (int) – Number of top words per topic (default 25).
Returns:: Topic diversity score in [0, 1].
Return type:: float

plot_topic_prevalence(save_path=None)[source]

Horizontal bar chart of mean topic prevalence across documents.

Parameters:: save_path (Optional[str]) – Path to save the figure.
Return type:: Tuple[Figure, Axes]

plot_topic_correlation(save_path=None)[source]

Heatmap of pairwise cosine similarity between topic-word vectors.

Parameters:: save_path (Optional[str]) – Path to save the figure.
Return type:: Tuple[Figure, Axes]

plot_document_topic_heatmap(n_docs=50, sort_by_topic=False, save_path=None)[source]

Heatmap of document-topic proportions for a subset of documents.

Parameters:

n_docs (int) – Number of documents to display (default 50).
sort_by_topic (bool) – If True, sort documents by their dominant topic (default False).
save_path (Optional[str]) – Path to save the figure.

Return type:

Tuple[Figure, Axes]

class PF(counts, vocab, num_topics, batch_size, initparams=None, constantparams=None, hyperparams=None)[source]

Bases: NumpyroModel

Poisson Factorization (PF) topic model.

Unsupervised baseline topic model using Poisson likelihood for word counts. Suitable for exploratory topic discovery in document collections.

This model learns low-rank representations of documents and words, enabling interpretable topic extraction and downstream analysis.

Parameters:

counts (csr_matrix) – Document-term matrix of shape (D, V) with word counts.
vocab (ndarray) – Vocabulary array of shape (V,) containing word terms.
num_topics (int) – Number of topics K. Must be > 0.
batch_size (int) – Mini-batch size for stochastic variational inference. Must satisfy 0 < batch_size <= D.
initparams (Optional[Dict[str, Any]]) – User-specified initial values for variational parameters in the guide.
constantparams (Optional[Dict[str, Any]]) – User-specified constant values for latent variables (not updated by SVI).
hyperparams (Optional[Dict[str, float]]) – User-specified hyperparameters overriding default prior settings.

D

Number of documents.

Type:: int

V

Vocabulary size.

Type:: int

K

Number of topics.

Type:: int

counts

Document-term matrix.

Type:: scipy.sparse.csr_matrix

vocab

Vocabulary array.

Type:: np.ndarray

Examples

>>> from scipy.sparse import random
>>> import numpy as np
>>> from topicmodels import PF
>>> counts = random(100, 500, density=0.01, format='csr')
>>> vocab = np.array([f'word_{i}' for i in range(500)])
>>> model = PF(counts, vocab, num_topics=10, batch_size=32)
>>> params = model.train_step(num_steps=100, lr=0.01, random_seed=42)
>>> topics, proportions = model.return_topics()

Initialize the PF model with input validation.

Parameters:

counts (csr_matrix) – Document-term matrix.
vocab (ndarray) – Vocabulary array.
num_topics (int) – Number of topics.
batch_size (int) – Mini-batch size.
initparams (Optional[Dict[str, Any]]) – Initial values for variational parameters.
constantparams (Optional[Dict[str, Any]]) – Fixed values for latent variables.
hyperparams (Optional[Dict[str, float]]) – Hyperparameters overriding default priors.

Raises:

TypeError – If counts is not a sparse matrix or vocab is not array-like.
ValueError – If dimensions are invalid or inconsistent.

class SPF(counts, vocab, keywords, residual_topics, batch_size, initparams=None, constantparams=None, hyperparams=None)[source]

Bases: NumpyroModel

Seeded Poisson Factorization (SPF) topic model.

Guided topic modeling with keyword priors. SPF allows researchers to incorporate domain knowledge by specifying seed words for each topic, which increases the topical prevalence of those words in the model.

Parameters:

counts (csr_matrix) – Document-term matrix of shape (D, V) with word counts.
vocab (ndarray) – Vocabulary array of shape (V,) containing word terms.
keywords (Dict[Any, List[str]]) – Dictionary mapping topic identifiers to lists of seed words. Keys can be strings or integers. Example: {0: [‘climate’, ‘environment’], 1: [‘economy’, ‘trade’]} or {‘climate’: [‘climate’, ‘environment’], ‘economy’: [‘economy’, ‘trade’]}
residual_topics (int) – Number of residual (unsupervised) topics. Must be >= 0.
batch_size (int) – Mini-batch size for stochastic variational inference. Must satisfy 0 < batch_size <= D.
initparams (Optional[Dict[str, Any]]) – User-specified initial values for variational parameters in the guide.
constantparams (Optional[Dict[str, Any]]) – User-specified constant values for latent variables (not updated by SVI).
hyperparams (Optional[Dict[str, float]]) – User-specified hyperparameters overriding default prior settings.

D

Number of documents.

Type:: int

V

Vocabulary size.

Type:: int

K

Total number of topics (seeded + residual).

Type:: int

counts

Document-term matrix.

Type:: scipy.sparse.csr_matrix

vocab

Vocabulary array.

Type:: np.ndarray

keywords

Seed words for guided topics.

Type:: Dict[int, List[str]]

residual_topics

Number of unsupervised topics.

Type:: int

Examples

>>> from scipy.sparse import random
>>> import numpy as np
>>> from topicmodels import SPF
>>> counts = random(100, 500, density=0.01, format='csr')
>>> vocab = np.array([f'word_{i}' for i in range(500)])
>>> keywords = {
...     0: ['word_1', 'word_2', 'word_3'],
...     1: ['word_10', 'word_11', 'word_12'],
... }
>>> model = SPF(counts, vocab, keywords, residual_topics=5, batch_size=32)
>>> params = model.train_step(num_steps=100, lr=0.01, random_seed=42)

Initialize the SPF model with input validation.

Parameters:

counts (csr_matrix) – Document-term matrix.
vocab (ndarray) – Vocabulary array.
keywords (Dict[Any, List[str]]) – Seed words for each seeded topic.
residual_topics (int) – Number of unsupervised topics.
batch_size (int) – Mini-batch size.
initparams (Optional[Dict[str, Any]]) – Initial values for variational parameters.
constantparams (Optional[Dict[str, Any]]) – Fixed values for latent variables.
hyperparams (Optional[Dict[str, float]]) – Hyperparameters overriding default priors.

Raises:

TypeError – If counts is not sparse or keywords is not dict.
ValueError – If dimensions are invalid or keywords contain unknown terms.

return_topics()[source]

Return the topics for each document. Reimplemented from the base class due to the guided topic modeling approach, where topics are not fully unsupervised.

Returns:

topicsnumpy.ndarray: Array of recoded topics.
E_thetanumpy.ndarray: Estimated topic proportions for each document.

Return type:

tuple

return_beta()[source]

Return the beta matrix for the model, i.e. topic-word intensities. Reimplemented from the base class due to the higher rates approach for seed words.

Returns:: DataFrame containing the beta matrix with words as rows and topics as columns.
Return type:: pandas.DataFrame

plot_seed_effectiveness(save_path=None)[source]

Grouped bar chart comparing seed vs. non-seed word weights per topic.

For every seeded topic, shows the mean beta weight of seed words alongside the mean beta weight of all other words. Helps assess whether seed words actually dominate their intended topics.

Parameters:: save_path (Optional[str]) – Path to save the figure.
Return type:: Tuple[Figure, ndarray]

class CPF(counts, vocab, num_topics, batch_size, X_design_matrix=None, initparams=None, constantparams=None, hyperparams=None, link_function='softplus')[source]

Bases: NumpyroModel

Covariate Poisson Factorization (CPF) topic model.

Topic model that incorporates document-level covariates to capture how topics vary with external variables (e.g., author attributes, temporal features).

Parameters:

counts (csr_matrix) – Document-term matrix of shape (D, V) with word counts.
vocab (ndarray) – Vocabulary array of shape (V,) containing word terms.
covariates (np.ndarray or pd.DataFrame) – Document-level covariates of shape (D, C) where C is number of features.
num_topics (int) – Number of topics K. Must be > 0.
batch_size (int) – Mini-batch size for stochastic variational inference. Must satisfy 0 < batch_size <= D.
initparams (Optional[Dict[str, Any]]) – User-specified initial values for variational parameters in the guide.
constantparams (Optional[Dict[str, Any]]) – User-specified constant values for latent variables (not updated by SVI).
hyperparams (Optional[Dict[str, float]]) – User-specified hyperparameters overriding default prior settings.
link_function (str) – Positive link function used to map the linear predictor for document-topic intensities to their Gamma prior mean. Default is “softplus”.
X_design_matrix (ndarray | None)

D

Number of documents.

Type:: int

V

Vocabulary size.

Type:: int

K

Number of topics.

Type:: int

C

Number of covariate features.

Type:: int

counts

Document-term matrix.

Type:: scipy.sparse.csr_matrix

vocab

Vocabulary array.

Type:: np.ndarray

X_design_matrix

Design matrix of covariates.

Type:: jnp.ndarray

G

Number of covariate groups.

Type:: int

group_scaling_diag

Per-covariate scaling derived from (X_g^T X_g)^{-1}.

Type:: jnp.ndarray

Examples

>>> from scipy.sparse import random
>>> import numpy as np
>>> from topicmodels import CPF
>>> counts = random(100, 500, density=0.01, format='csr')
>>> vocab = np.array([f'word_{i}' for i in range(500)])
>>> covariates = np.random.randn(100, 3)  # 3 covariate features
>>> model = CPF(counts, vocab, covariates, num_topics=10, batch_size=32)
>>> params = model.train_step(num_steps=100, lr=0.01, random_seed=42)

Initialize the CPF model with input validation.

Parameters:

counts (csr_matrix) – Document-term matrix.
vocab (ndarray) – Vocabulary array.
num_topics (int) – Number of topics.
batch_size (int) – Mini-batch size.
X_design_matrix (Optional[ndarray]) – Document-level covariates.
initparams (Optional[Dict[str, Any]]) – Initial values for variational parameters.
constantparams (Optional[Dict[str, Any]]) – Fixed values for latent variables.
hyperparams (Optional[Dict[str, float]]) – Hyperparameters overriding default priors.
link_function (str)

Raises:

TypeError – If counts is not sparse or covariates have wrong type.
ValueError – If dimensions are invalid or inconsistent.

return_covariate_effects()[source]

Return point estimates of covariate effects (lambda).

Return type:: DataFrame

return_covariate_effects_ci(ci=0.95)[source]

Return covariate effects with credible intervals.

Uses the Normal variational posterior for lambda: mean = lambda_location, CI = mean +/- z * lambda_scale.

Parameters:: ci (float) – Credible-interval level (default 0.95).
Returns:: DataFrame with columns ['covariate', 'topic', 'mean', 'lower', 'upper'].
Return type:: DataFrame
Raises:: ValueError – If model has not been trained yet.

plot_cov_effects(ci=0.95, include_shrinkage=False, topics=None, group_colors=None, figsize_per_topic=(5.0, 0.28), save_path=None)[source]

Plot covariate effects as forest plots.

Parameters:

ci (float) – Credible-interval level (default 0.95 for 95 % CI).
include_shrinkage (bool) – If True, additionally produce forest plots for \(\lambda_0\) (intercept), \(\tau^2_k\) (global shrinkage), and \(\delta^2_{gk}\) (group shrinkage).
topics (Optional[List[str]]) – Subset of topic names to plot. If None (default), all topics are plotted.
group_colors (Optional[Dict[str, str]]) – Mapping {group_name: colour} used to colour the covariate labels on the y-axis. Groups are inferred from the :: separator in covariate names. If None a default qualitative palette is used.
figsize_per_topic (Tuple[float, float]) – (width, height_per_covariate) used to auto-size the lambda panels. Default (5.0, 0.28).
save_path (Optional[str]) – Directory (or file path) where figures are saved. When a directory is given, individual PNGs are written; when a file path is given, only the lambda figure is saved there. If None, figures are not saved.

Returns:

{"lambda": (fig, axes), ...} and, when include_shrinkage is True, additional entries "lambda_intercept", "tau2", "delta2".

Return type:

Dict[str, Tuple[Figure, ndarray]]

class CSPF(counts, vocab, keywords, residual_topics, batch_size, X_design_matrix=None, initparams=None, constantparams=None, hyperparams=None, link_function='softplus')[source]

Bases: NumpyroModel

Covariate Seeded Poisson Factorization (CSPF) topic model with grouped design-adaptive shrinkage.

Topic model that extends covariate Poisson factorization by combining document-level covariates with seeded topic structure and grouped shrinkage on covariate effects.

Parameters:

counts (csr_matrix) – Document-term matrix of shape (D, V) with word counts.
vocab (ndarray) – Vocabulary array of shape (V,) containing word terms.
keywords (Dict[Any, List[str]]) – Dictionary mapping topic identifiers to lists of seed words. Keys can be strings or integers. Example: {0: [‘climate’, ‘environment’], 1: [‘economy’, ‘trade’]} or {‘climate’: [‘climate’, ‘environment’], ‘economy’: [‘economy’, ‘trade’]}
residual_topics (int) – Number of residual (unsupervised) topics. Must be >= 0.
batch_size (int) – Mini-batch size for stochastic variational inference. Must satisfy 0 < batch_size <= D.
X_design_matrix (Optional[ndarray]) – Document-level covariates of shape (D, C) where C is the number of features.
initparams (Optional[Dict[str, Any]]) – User-specified initial values for variational parameters in the guide.
constantparams (Optional[Dict[str, Any]]) – User-specified constant values for latent variables (not updated by SVI).
hyperparams (Optional[Dict[str, float]]) – User-specified hyperparameters overriding default prior settings.
link_function (str) – Positive link function used to map the linear predictor for document-topic intensities to their Gamma prior mean. Default is “softplus”.

D

Number of documents.

Type:: int

V

Vocabulary size.

Type:: int

K

Total number of topics (seeded + residual).

Type:: int

C

Number of covariate features.

Type:: int

counts

Document-term matrix.

Type:: scipy.sparse.csr_matrix

vocab

Vocabulary array.

Type:: np.ndarray

keywords

Seed words for guided topics.

Type:: Dict[Any, List[str]]

residual_topics

Number of unsupervised topics.

Type:: int

X_design_matrix

Design matrix of covariates.

Type:: jnp.ndarray

G

Number of covariate groups.

Type:: int

group_scaling_diag

Per-covariate scaling derived from (X_g^T X_g)^{-1}.

Type:: jnp.ndarray

link_function

Name of the positive link function (“softplus” or “exp”).

Type:: str

Initialize base model with per-instance metrics.

return_topics()[source]

Return the topics for each document.

Returns:

categories (np.ndarray) – Array of topic indices for each document (shape: D,).
E_theta (np.ndarray) – Estimated topic proportions for each document (shape: D, K).

Raises:

ValueError – If model has not been trained yet (no estimated parameters).

return_beta()[source]

Return the beta matrix (word-topic associations) for the model.

Returns:: DataFrame with words as index and topics as columns, containing word-topic probability estimates.
Return type:: pd.DataFrame
Raises:: ValueError – If model has not been trained yet (no estimated parameters).

return_covariate_effects()[source]

Return point estimates of covariate effects (lambda).

Returns:: DataFrame with covariates as rows and topics as columns.
Return type:: DataFrame

return_covariate_effects_ci(ci=0.95)[source]

Return covariate effects with credible intervals.

Uses the Normal variational posterior for lambda: mean = lambda_location, CI = mean +/- z * lambda_scale.

Parameters:: ci (float) – Credible-interval level (default 0.95).
Returns:: DataFrame with columns ['covariate', 'topic', 'mean', 'lower', 'upper'].
Return type:: DataFrame
Raises:: ValueError – If model has not been trained yet.

plot_cov_effects(ci=0.95, include_shrinkage=False, topics=None, group_colors=None, figsize_per_topic=(5.0, 0.28), save_path=None)[source]

Plot covariate effects as forest plots.

Parameters:

ci (float) – Credible-interval level (default 0.95 for 95 % CI).
include_shrinkage (bool) – If True, additionally produce forest plots for \(\lambda_0\) (intercept), \(\tau^2_k\) (global shrinkage), and \(\delta^2_{gk}\) (group shrinkage).
topics (Optional[List[str]]) – Subset of topic names to plot. If None (default), all topics are plotted.
group_colors (Optional[Dict[str, str]]) – Mapping {group_name: colour} used to colour the covariate labels on the y-axis. Groups are inferred from the :: separator in covariate names. If None a default qualitative palette is used.
figsize_per_topic (Tuple[float, float]) – (width, height_per_covariate) used to auto-size the lambda panels. Default (5.0, 0.28).
save_path (Optional[str]) – Directory (or file path) where figures are saved. When a directory is given, individual PNGs are written; when a file path is given, only the lambda figure is saved there. If None, figures are not saved.

Returns:

{"lambda": (fig, axes), ...} and, when include_shrinkage is True, additional entries "lambda_intercept", "tau2", "delta2".

Return type:

Dict[str, Tuple[Figure, ndarray]]

class TBIP(counts, vocab, num_topics, authors, batch_size, time_varying=False, initparams=None, constantparams=None, hyperparams=None)[source]

Bases: NumpyroModel

TBIP Model

This class models topic-based ideal points (TBIP) in a set of documents authored by multiple individuals.

Initialize the TBIP model.

Parameters:

counts (csr_matrix) – A 2D sparse array of shape (D, V) representing the word counts in each document, where D is the number of documents and V is the vocabulary size.
vocab (ndarray) – A vocabulary array of shape (V,) containing word terms.
num_topics (int) – The number of topics (K). Must be > 0.
authors (ndarray) – An array of authors for each document.
batch_size (int) – The number of documents to be processed in each batch. Must satisfy 0 < batch_size <= D.
time_varying (bool) – Whether to model time-varying ideal points (default is False).
initparams (Optional[Dict[str, Any]]) – User-specified initial values for variational parameters in the guide.
constantparams (Optional[Dict[str, Any]]) – User-specified constant values for latent variables (not updated by SVI).
hyperparams (Optional[Dict[str, float]]) – User-specified hyperparameters overriding default prior settings.

Raises:

TypeError – If counts is not a sparse matrix.
ValueError – If dimensions are invalid or time_varying parameters have wrong shape.

train_step(num_steps, lr)[source]

Train the TBIP model using stochastic variational inference.

Custom train function specified exclusively for TBIP objects.

Parameters:

num_steps (int) – Number of training steps. Must be > 0.
lr (float) – Learning rate for the optimizer. Must be > 0.

Returns:

A dictionary containing the estimated parameter values after training.

Return type:

dict

Raises:

ValueError – If num_steps or lr are invalid.

return_topics()[source]

Return the dominant topic for each document.

Uses the LogNormal variational posterior for theta: E[theta] = exp(mu + sigma^2 / 2).

Return type:

Tuple[ndarray, ndarray]

Returns:

categories (np.ndarray) – Array of topic indices for each document (shape: D,).
E_theta (np.ndarray) – Estimated topic proportions for each document (shape: D, K).

Raises:

ValueError – If model has not been trained yet.

return_beta()[source]

Return the topic-word association matrix.

Uses the LogNormal variational posterior for beta: E[beta] = exp(mu + sigma^2 / 2).

Returns:: DataFrame with words as index and topics as columns.
Return type:: DataFrame
Raises:: ValueError – If model has not been trained yet.

return_ideal_points()[source]

Return ideal point estimates for all authors.

Returns:: DataFrame with columns ['author', 'ideal_point', 'std'] sorted by ideal point.
Return type:: DataFrame
Raises:: ValueError – If model has not been trained yet.

return_ideological_words(topic, n=10)[source]

Return words with the strongest ideological loading for a topic.

For a given topic k, ranks words by the magnitude of their ideological coefficient eta[k, :]. Words with large positive eta are associated with higher ideal-point values, and vice versa.

Parameters:

topic (int) – Topic index (0-based).
n (int) – Number of top words per direction (default 10).

Returns:

DataFrame with columns ['word', 'eta', 'direction'] where direction is 'positive' or 'negative'.

Return type:

DataFrame

Raises:

ValueError – If model has not been trained or topic index is invalid.

plot_ideal_points(selected_authors=None, show_ci=False, ci=0.95, figsize=(12, 2), save_path=None)[source]

Plot the ideal points of authors on a 1-D axis.

Parameters:

selected_authors (Optional[list]) – Authors to label (default: all authors).
show_ci (bool) – If True, display horizontal error bars showing the credible interval derived from sigma_x.
ci (float) – Credible-interval level when show_ci is True (default 0.95).
figsize (Tuple[float, float]) – Figure size (default (12, 2)).
save_path (Optional[str]) – Path to save the figure.

Return type:

Tuple[Figure, Axes]

class STBS(counts, vocab, num_topics, authors, batch_size, X_design_matrix=None, initparams=None, constantparams=None, hyperparams=None)[source]

Bases: NumpyroModel

STBS Model

This class models structural text-based scaling (STBS), including topic-specific ideal points and author-specific covariates for documents authored by different individuals. The model aims to capture how ideology can vary by topic and with external variables.

Initialize the STBS model.

Parameters:

counts (csr_matrix) – A 2D sparse array of shape (D, V) representing the word counts in each document, where D is the number of documents and V is the vocabulary size.
vocab (ndarray) – A vocabulary array of shape (V,) containing word terms.
num_topics (int) – The number of topics (K). Must be > 0.
authors (ndarray) – An array of authors for each document.
X_design_matrix (Optional[ndarray]) – Author-level covariates of shape (N, L). Row i must correspond to the i-th element of the sorted unique authors from authors (i.e., np.unique(authors)).
batch_size (int) – The number of documents to be processed in each batch. Must satisfy 0 < batch_size <= D.
initparams (Optional[Dict[str, Any]]) – User-specified initial values for variational parameters in the guide.
constantparams (Optional[Dict[str, Any]]) – User-specified constant values for latent variables (not updated by SVI).
hyperparams (Optional[Dict[str, float]]) – User-specified hyperparameters overriding default prior settings.

Raises:

TypeError – If counts is not a sparse matrix.
ValueError – If dimensions are invalid or time_varying parameters have wrong shape.

train_step(num_steps, lr)[source]

Train the STBS model using stochastic variational inference.

Custom train function specified exclusively for TBIP and STBS objects.

Parameters:

num_steps (int) – Number of training steps. Must be > 0.
lr (float) – Learning rate for the optimizer. Must be > 0.

Returns:

A dictionary containing the estimated parameter values after training.

Return type:

dict

Raises:

ValueError – If num_steps or lr are invalid.

plot_topic_wordclouds(n_words=50, figsize=(16, 12), ideology_values=(-1, 0, 1), topics=None, log_corrected=True, save_path=None)[source]

Plot wordclouds for each topic, optionally at multiple ideology positions. When ideology_values is None, delegates to the base class and plots one wordcloud per topic using raw beta values. When ideology_values is set (default: (-1, 0, 1)), produces a grid of shape (n_topics, len(ideology_values)).

Parameters:

n_words (int) – Maximum number of words per wordcloud (default 50).
figsize (Tuple[int, int]) – Figure size (width, height) (default (16, 12)).
save_path (Optional[str]) – Path to save the figure.
ideology_values (Optional[Tuple[float, ...]]) – Ideal point values for which to draw wordclouds. Default values are (-1, 0, 1). Pass None to fall back to base class behaviour (raw beta, no ideology).
topics (Optional[List[int]]) – Subset of topic indices to plot. If None, all K topics are shown.
log_corrected (bool) – If True (default), uses log-scale ideology-corrected intensities. If False, uses the linear approximation beta * exp(eta * i) instead. Ignored when ideology_values is None.

Return type:

Tuple[Figure, ndarray]

plot_topic_prevalence(topic_labels=None, selected_topics=None, sort=True, figsize=(8, 4), save_path=None)[source]

Bar chart of mean normalised topic prevalence across the corpus.

Return type:

Tuple[Figure, Axes]

Parameters:

topic_labels (dict | None)
selected_topics (list | None)
sort (bool)
figsize (tuple)
save_path (str | None)

plot_author_topic_heatmap(topic_labels=None, author_labels=None, selected_topics=None, figsize=(16, 12), save_path=None)[source]

Heatmap of mean normalised topic proportions per author (topics x authors).

Authors are sorted by their dominant topic so similar authors cluster together.

Parameters:

topic_labels (Optional[dict]) – {topic_index: “label”}
author_labels (Optional[dict]) – {author_index: “label”} — if None, uses raw author indices.
selected_topics (Optional[list]) – Integer topic indices to restrict the plot. If None, all topics shown.
figsize (tuple)
save_path (str | None)

Return type:

Tuple[Figure, Axes]

plot_ideol_points(group=True, group_var=None, group_labels=None, group_palette=None, topic_labels=None, figsize=(16, 12), save_path=None)[source]

Dot plot of topic-specific ideological positions of all authors.

Topics are ordered by the absolute difference between group-weighted average positions (most polarising topic at the top). Group-weighted averages are shown as black ‘X’ markers connected by a horizontal line.

Parameters:

group (bool) – If True (default), colours dots by group. Falls back to self.i_mu_init if group_var is not provided. If False, all dots are plotted in a single colour.
group_var (Optional[ndarray]) – Author-level grouping variable. Overrides self.i_mu_init when provided. Unique values are treated as group identifiers.
group_labels (Optional[dict]) – Mapping {value: "label"}, e.g. {-1: "D", 0: "I", 1: "R"}. If None, groups are labelled by their raw value.
group_palette (Optional[dict]) – Mapping {label: colour}. If None, uses a default tab10 palette.
topic_labels (Optional[dict]) – Optional {topic_index: "label"} for y-axis tick labels.
selected_topics (list or None) – Integer topic indices to restrict the plot. If None, all topics shown.
figsize (Tuple[float, float]) – Figure size (default (7, 5)).
save_path (Optional[str]) – Path to save the figure.

Return type:

Tuple[Figure, Axes]

plot_iota_credible_intervals(topic_labels=None, covariate_labels=None, selected_topics=None, selected_covariates=None, ci=0.95, figsize=(16, 12), save_path=None)[source]

Single CI plot with selected covariates on y-axis and topics as hue.

Return type:

Tuple[Figure, Axes]

Parameters:

topic_labels (dict | None)
covariate_labels (dict | None)
selected_topics (list | None)
selected_covariates (list | None)
ci (float)
figsize (tuple)
save_path (str | None)

return_ideal_points()[source]

Return ideal point estimates for all authors and topics.

Returns:: DataFrame with columns ['author', 'topic', 'ideal_point', 'std'] sorted by topic then ideal point.
Return type:: DataFrame
Raises:: ValueError – If model has not been trained yet.

return_ideal_covariates()[source]

Return covariate regression coefficient estimates (iota).

Returns:: DataFrame with columns ['covariate', 'topic', 'iota', 'std'] sorted by topic then covariate.
Return type:: DataFrame
Raises:: ValueError – If model has not been trained yet.

class FlaxEncoder(num_topics, hidden, parent=<flax.linen.module._Sentinel object>, name=None)[source]

Bases: Module

Neural network encoder for variational inference.

Parameters:

num_topics (int)
hidden (int)
parent (Type[Module] | Scope | Type[_Sentinel] | None)
name (str | None)

num_topics

Number of topics K.

Type:: int

hidden

Hidden layer dimension.

Type:: int

num_topics: int

hidden: int

name: str | None = None

parent: Type[Module] | Scope | Type[_Sentinel] | None = None

scope: Scope | None = None

class ETM(counts, vocab, num_topics, batch_size, embeddings_mapping, embed_size=300)[source]

Bases: NumpyroModel

Embedded Topic Model (ETM).

Learns topic representations in word embedding space using neural variational inference. Combines neural networks with Bayesian topic modeling for improved interpretability.

Parameters:

counts (csr_matrix) – Document-term matrix of shape (D, V) with word counts.
vocab (ndarray) – Vocabulary array of shape (V,) containing word terms.
num_topics (int) – Number of topics K. Must be > 0.
batch_size (int) – Mini-batch size for stochastic variational inference. Must satisfy 0 < batch_size <= D.
embeddings_mapping (Dict) – Mapping from words to embedding vectors.
embed_size (int) – Embedding dimension (default is 300).

D

Number of documents.

Type:: int

V

Vocabulary size.

Type:: int

K

Number of topics.

Type:: int

rho

Word embedding matrix of shape (V, embed_size).

Type:: np.ndarray

encoder

Neural encoder for variational inference.

Type:: FlaxEncoder

Initialize the ETM model.

Parameters:

counts (csr_matrix) – Document-term matrix.
vocab (ndarray) – Vocabulary array.
num_topics (int) – Number of topics.
batch_size (int) – Mini-batch size.
embeddings_mapping (Dict) – Word to embedding mapping.
embed_size (int) – Embedding dimension (default is 300).

Raises:

TypeError – If counts is not a sparse matrix.
ValueError – If dimensions are invalid or embeddings_mapping is empty.

return_topics()[source]

Extract dominant topic per document and topic proportions.

The topic proportions theta are obtained by passing the normalised bag-of-words through the trained neural encoder and applying softmax.

Return type:

Tuple[ndarray, ndarray]

Returns:

categories (np.ndarray) – Dominant topic index per document (shape: D,).
E_theta (np.ndarray) – Document-topic proportions (shape: D, K).

Raises:

ValueError – If model has not been trained yet.

return_beta()[source]

Extract the topic-word distribution matrix.

Computes beta = softmax(rho @ alpha) where rho are the word embeddings and alpha is the learned embedding-to-topic projection matrix.

Returns:: DataFrame of shape (V, K) with words as index and topics as columns. Each column sums to 1.
Return type:: DataFrame
Raises:: ValueError – If model has not been trained yet.

class Metrics(loss=<factory>, coherence_scores=None, diversity=None)[source]

Bases: object

Data class for storing training and evaluation metrics.

Tracks model performance during training by recording loss values at each iteration, and stores topic-quality metrics computed post-fitting.

Parameters:

loss (List[Any])
coherence_scores (DataFrame | None)
diversity (float | None)

loss

List of loss values for each training iteration.

Type:: List[float]

coherence_scores

Per-topic coherence scores computed by NumpyroModel.compute_topic_coherence().

Type:: pd.DataFrame or None

diversity

Topic diversity score computed by NumpyroModel.compute_topic_diversity().

Type:: float or None

Examples

>>> metrics = Metrics(loss=[])
>>> metrics.loss.append(0.5)
>>> len(metrics.loss)
1

loss: List[Any]

coherence_scores: DataFrame | None = None

diversity: float | None = None

reset()[source]

Reset all metrics to empty state.

Return type:: None

last_loss()[source]

Get the most recent loss value.

Return type:: Any