API Reference (Auto-generated)
- class NumpyroModel[source]
Bases:
ABCAbstract base class for all used probabilistic models. Each model has to implement at least their own Model and Guide.
- estimated_params
Estimated parameters after training.
- Type:
dict
- D
Number of documents.
- Type:
int
- V
Vocabulary size.
- Type:
int
- batch_size
Mini-batch size for stochastic variational inference.
- Type:
int
- counts
Document-term matrix.
- Type:
scipy.sparse.csr_matrix
- vocab
Vocabulary array.
- Type:
np.ndarray
- K
Number of topics.
- Type:
int
Initialize base model with per-instance metrics.
- train_step(num_steps, lr, random_seed=None, jit_compile=True, cache_dense_counts=None, dense_cache_max_gb=0.75)[source]
Train the model using Stochastic Variational Inference (SVI).
- Parameters:
num_steps (
int) – Number of training iterations. Must be > 0.lr (
float) – Learning rate for the optimizer. Must be > 0.random_seed (
Optional[int]) – Seed for JAX random number generator. If provided, ensures reproducible results. Default is None (random initialization).jit_compile (
bool) – Whether to JIT compile SVI updates. Keep enabled for long runs; disable to avoid compile overhead in very short runs.cache_dense_counts (
Optional[bool]) – If True, cache sparse counts as dense array for faster batching. If None, auto-enable when estimated dense matrix size fits indense_cache_max_gb.dense_cache_max_gb (
float) – Maximum dense cache size in GB used by auto mode.
- Returns:
Estimated parameters after training.
- Return type:
Dict[str,Any]- Raises:
ValueError – If num_steps <= 0 or lr <= 0.
- return_topics()[source]
Return the topics for each document.
- Return type:
Tuple[ndarray,ndarray]- Returns:
categories (np.ndarray) – Array of topic indices for each document (shape: D,).
E_theta (np.ndarray) – Estimated topic proportions for each document (shape: D, K).
- Raises:
ValueError – If model has not been trained yet (no estimated parameters).
- return_beta()[source]
Return the beta matrix (word-topic associations) for the model.
- Returns:
DataFrame with words as index and topics as columns, containing word-topic probability estimates.
- Return type:
DataFrame- Raises:
ValueError – If model has not been trained yet (no estimated parameters).
- plot_model_loss(window=10, save_path=None)[source]
Plot the training loss over time with full and smoothed curves.
- Parameters:
window (
int) – Window size for moving average smoothing. Default is 10.save_path (
Optional[str]) – Path to save the figure.
- Return type:
Tuple[Figure,Axes]
- plot_topic_wordclouds(n_words=50, figsize=(16, 12), save_path=None)[source]
Plot wordclouds for each topic based on beta values.
- Parameters:
n_words (
int) – Maximum number of words per wordcloud (default 50).figsize (
Tuple[int,int]) – Figure size(width, height)(default(16, 12)).save_path (
Optional[str]) – Path to save the figure.
- Return type:
Tuple[Figure,ndarray]
- summary(n_top_words=5)[source]
Return a formatted text summary of the fitted model.
Includes model class name, dimensions, loss trajectory, and top words per topic. Subclasses can extend the output by overriding
_summary_extra().- Parameters:
n_top_words (
int) – Number of top words to show per topic (default 5).- Returns:
Multi-line summary string.
- Return type:
str
- compute_topic_coherence(texts=None, metric='c_npmi', top_n=10)[source]
Compute topic coherence scores (NPMI or UMass).
- Parameters:
texts (
Optional[List[List[str]]]) – Tokenised reference corpus. IfNone, word co-occurrence is estimated fromself.countsandself.vocab.metric (
str) – Coherence measure (default'c_npmi').top_n (
int) – Number of top words per topic used for the calculation (default 10).
- Returns:
DataFrame with columns
['topic', 'coherence'].- Return type:
DataFrame
- compute_topic_diversity(top_n=25)[source]
Compute topic diversity (Dieng et al., 2020).
Measures the fraction of unique words across all topics’ top-n lists. Values near 1.0 indicate diverse topics; near 0 indicates redundancy.
- Parameters:
top_n (
int) – Number of top words per topic (default 25).- Returns:
Topic diversity score in
[0, 1].- Return type:
float
- plot_topic_prevalence(save_path=None)[source]
Horizontal bar chart of mean topic prevalence across documents.
- Parameters:
save_path (
Optional[str]) – Path to save the figure.- Return type:
Tuple[Figure,Axes]
- plot_topic_correlation(save_path=None)[source]
Heatmap of pairwise cosine similarity between topic-word vectors.
- Parameters:
save_path (
Optional[str]) – Path to save the figure.- Return type:
Tuple[Figure,Axes]
- plot_document_topic_heatmap(n_docs=50, sort_by_topic=False, save_path=None)[source]
Heatmap of document-topic proportions for a subset of documents.
- Parameters:
n_docs (
int) – Number of documents to display (default 50).sort_by_topic (
bool) – If True, sort documents by their dominant topic (default False).save_path (
Optional[str]) – Path to save the figure.
- Return type:
Tuple[Figure,Axes]
- class PF(counts, vocab, num_topics, batch_size)[source]
Bases:
NumpyroModelPoisson Factorization (PF) topic model.
Unsupervised baseline topic model using Poisson likelihood for word counts. Suitable for exploratory topic discovery in document collections.
This model learns low-rank representations of documents and words, enabling interpretable topic extraction and downstream analysis.
- Parameters:
counts (
csr_matrix) – Document-term matrix of shape (D, V) with word counts.vocab (
ndarray) – Vocabulary array of shape (V,) containing word terms.num_topics (
int) – Number of topics K. Must be > 0.batch_size (
int) – Mini-batch size for stochastic variational inference. Must satisfy 0 < batch_size <= D.
- D
Number of documents.
- Type:
int
- V
Vocabulary size.
- Type:
int
- K
Number of topics.
- Type:
int
- counts
Document-term matrix.
- Type:
scipy.sparse.csr_matrix
- vocab
Vocabulary array.
- Type:
np.ndarray
Examples
>>> from scipy.sparse import random >>> import numpy as np >>> from topicmodels import PF >>> counts = random(100, 500, density=0.01, format='csr') >>> vocab = np.array([f'word_{i}' for i in range(500)]) >>> model = PF(counts, vocab, num_topics=10, batch_size=32) >>> params = model.train_step(num_steps=100, lr=0.01, random_seed=42) >>> topics, proportions = model.return_topics()
Initialize the PF model with input validation.
- Parameters:
counts (
csr_matrix) – Document-term matrix.vocab (
ndarray) – Vocabulary array.num_topics (
int) – Number of topics.batch_size (
int) – Mini-batch size.
- Raises:
TypeError – If counts is not a sparse matrix or vocab is not array-like.
ValueError – If dimensions are invalid or inconsistent.
- class SPF(counts, vocab, keywords, residual_topics, batch_size)[source]
Bases:
NumpyroModelSeeded Poisson Factorization (SPF) topic model.
Guided topic modeling with keyword priors. SPF allows researchers to incorporate domain knowledge by specifying seed words for each topic, which increases the topical prevalence of those words in the model.
- Parameters:
counts (
csr_matrix) – Document-term matrix of shape (D, V) with word counts.vocab (
ndarray) – Vocabulary array of shape (V,) containing word terms.keywords (
Dict[Any,List[str]]) – Dictionary mapping topic identifiers to lists of seed words. Keys can be strings or integers. Example: {0: [‘climate’, ‘environment’], 1: [‘economy’, ‘trade’]} or {‘climate’: [‘climate’, ‘environment’], ‘economy’: [‘economy’, ‘trade’]}residual_topics (
int) – Number of residual (unsupervised) topics. Must be >= 0.batch_size (
int) – Mini-batch size for stochastic variational inference. Must satisfy 0 < batch_size <= D.
- D
Number of documents.
- Type:
int
- V
Vocabulary size.
- Type:
int
- K
Total number of topics (seeded + residual).
- Type:
int
- counts
Document-term matrix.
- Type:
scipy.sparse.csr_matrix
- vocab
Vocabulary array.
- Type:
np.ndarray
- keywords
Seed words for guided topics.
- Type:
Dict[int, List[str]]
- residual_topics
Number of unsupervised topics.
- Type:
int
Examples
>>> from scipy.sparse import random >>> import numpy as np >>> from topicmodels import SPF >>> counts = random(100, 500, density=0.01, format='csr') >>> vocab = np.array([f'word_{i}' for i in range(500)]) >>> keywords = { ... 0: ['word_1', 'word_2', 'word_3'], ... 1: ['word_10', 'word_11', 'word_12'], ... } >>> model = SPF(counts, vocab, keywords, residual_topics=5, batch_size=32) >>> params = model.train_step(num_steps=100, lr=0.01, random_seed=42)
Initialize the SPF model with input validation.
- Parameters:
counts (
csr_matrix) – Document-term matrix.vocab (
ndarray) – Vocabulary array.keywords (
Dict[Any,List[str]]) – Seed words for each seeded topic.residual_topics (
int) – Number of unsupervised topics.batch_size (
int) – Mini-batch size.
- Raises:
TypeError – If counts is not sparse or keywords is not dict.
ValueError – If dimensions are invalid or keywords contain unknown terms.
- return_topics()[source]
Return the topics for each document. Reimplemented from the base class due to the guided topic modeling approach, where topics are not fully unsupervised.
- Returns:
- topicsnumpy.ndarray
Array of recoded topics.
- E_thetanumpy.ndarray
Estimated topic proportions for each document.
- Return type:
tuple
- return_beta()[source]
Return the beta matrix for the model, i.e. topic-word intensities. Reimplemented from the base class due to the higher rates approach for seed words.
- Returns:
DataFrame containing the beta matrix with words as rows and topics as columns.
- Return type:
pandas.DataFrame
- plot_seed_effectiveness(save_path=None)[source]
Grouped bar chart comparing seed vs. non-seed word weights per topic.
For every seeded topic, shows the mean beta weight of seed words alongside the mean beta weight of all other words. Helps assess whether seed words actually dominate their intended topics.
- Parameters:
save_path (
Optional[str]) – Path to save the figure.- Return type:
Tuple[Figure,ndarray]
- class CPF(counts, vocab, num_topics, batch_size, X_design_matrix=None)[source]
Bases:
NumpyroModelCovariate Poisson Factorization (CPF) topic model.
Topic model that incorporates document-level covariates to capture how topics vary with external variables (e.g., author attributes, temporal features).
- Parameters:
counts (
csr_matrix) – Document-term matrix of shape (D, V) with word counts.vocab (
ndarray) – Vocabulary array of shape (V,) containing word terms.covariates (np.ndarray or pd.DataFrame) – Document-level covariates of shape (D, C) where C is number of features.
num_topics (
int) – Number of topics K. Must be > 0.batch_size (
int) – Mini-batch size for stochastic variational inference. Must satisfy 0 < batch_size <= D.X_design_matrix (ndarray | None)
- D
Number of documents.
- Type:
int
- V
Vocabulary size.
- Type:
int
- K
Number of topics.
- Type:
int
- C
Number of covariate features.
- Type:
int
- counts
Document-term matrix.
- Type:
scipy.sparse.csr_matrix
- vocab
Vocabulary array.
- Type:
np.ndarray
- X_design_matrix
Design matrix of covariates.
- Type:
jnp.ndarray
Examples
>>> from scipy.sparse import random >>> import numpy as np >>> from topicmodels import CPF >>> counts = random(100, 500, density=0.01, format='csr') >>> vocab = np.array([f'word_{i}' for i in range(500)]) >>> covariates = np.random.randn(100, 3) # 3 covariate features >>> model = CPF(counts, vocab, covariates, num_topics=10, batch_size=32) >>> params = model.train_step(num_steps=100, lr=0.01, random_seed=42)
Initialize the CPF model with input validation.
- Parameters:
counts (
csr_matrix) – Document-term matrix.vocab (
ndarray) – Vocabulary array.num_topics (
int) – Number of topics.batch_size (
int) – Mini-batch size.X_design_matrix (
Optional[ndarray]) – Document-level covariates.
- Raises:
TypeError – If counts is not sparse or covariates have wrong type.
ValueError – If dimensions are invalid or inconsistent.
- return_covariate_effects()[source]
Return point estimates of covariate effects (lambda).
- Returns:
DataFrame with covariates as rows and topics as columns.
- Return type:
DataFrame- Raises:
ValueError – If model has not been trained yet.
- return_covariate_effects_ci(ci=0.95)[source]
Return covariate effects with credible intervals.
Uses the Normal variational posterior for lambda:
mean = lambda_location,CI = mean +/- z * lambda_scale.- Parameters:
ci (
float) – Credible-interval level (default 0.95).- Returns:
DataFrame with columns
['covariate', 'topic', 'mean', 'lower', 'upper'].- Return type:
DataFrame- Raises:
ValueError – If model has not been trained yet.
- plot_cov_effects(ci=0.95, topics=None, figsize_per_topic=(5.0, 0.28), save_path=None)[source]
Forest plot of covariate effects (lambda) with credible intervals.
- Parameters:
ci (
float) – Credible-interval level (default 0.95).topics (
Optional[List[str]]) – Subset of topic names to plot. IfNone, all topics are plotted.figsize_per_topic (
Tuple[float,float]) –(width, height_per_covariate)for panel sizing.save_path (
Optional[str]) – Path to save the figure.
- Return type:
Tuple[Figure,ndarray]
- class CSPF(counts, vocab, keywords, residual_topics, batch_size, X_design_matrix=None)[source]
Bases:
NumpyroModelCovariate Seeded Poisson Factorization with grouped design-adaptive shrinkage.
This implementation preserves the CSPF interface while replacing the internal covariate-effect specification with the model in
CSPF_model_new.tex.Initialize base model with per-instance metrics.
- Parameters:
counts (csr_matrix)
vocab (ndarray)
keywords (Dict[Any, List[str]])
residual_topics (int)
batch_size (int)
X_design_matrix (ndarray | None)
- return_topics()[source]
Return the topics for each document.
- Returns:
categories (np.ndarray) – Array of topic indices for each document (shape: D,).
E_theta (np.ndarray) – Estimated topic proportions for each document (shape: D, K).
- Raises:
ValueError – If model has not been trained yet (no estimated parameters).
- return_beta()[source]
Return the beta matrix (word-topic associations) for the model.
- Returns:
DataFrame with words as index and topics as columns, containing word-topic probability estimates.
- Return type:
pd.DataFrame
- Raises:
ValueError – If model has not been trained yet (no estimated parameters).
- return_covariate_effects()[source]
Return point estimates of covariate effects (lambda).
- Returns:
DataFrame with covariates as rows and topics as columns.
- Return type:
DataFrame
- return_covariate_effects_ci(ci=0.95)[source]
Return covariate effects with credible intervals.
Uses the Normal variational posterior for lambda:
mean = lambda_location,CI = mean +/- z * lambda_scale.- Parameters:
ci (
float) – Credible-interval level (default 0.95).- Returns:
DataFrame with columns
['covariate', 'topic', 'mean', 'lower', 'upper'].- Return type:
DataFrame- Raises:
ValueError – If model has not been trained yet.
- plot_cov_effects(ci=0.95, include_shrinkage=False, topics=None, group_colors=None, figsize_per_topic=(5.0, 0.28), save_path=None)[source]
Plot covariate effects as forest plots.
- Parameters:
ci (
float) – Credible-interval level (default0.95for 95 % CI).include_shrinkage (
bool) – IfTrue, additionally produce forest plots for \(\lambda_0\) (intercept), \(\tau^2_k\) (global shrinkage), and \(\delta^2_{gk}\) (group shrinkage).topics (
Optional[List[str]]) – Subset of topic names to plot. IfNone(default), all topics are plotted.group_colors (
Optional[Dict[str,str]]) – Mapping{group_name: colour}used to colour the covariate labels on the y-axis. Groups are inferred from the::separator in covariate names. IfNonea default qualitative palette is used.figsize_per_topic (
Tuple[float,float]) –(width, height_per_covariate)used to auto-size the lambda panels. Default(5.0, 0.28).save_path (
Optional[str]) – Directory (or file path) where figures are saved. When a directory is given, individual PNGs are written; when a file path is given, only the lambda figure is saved there. IfNone, figures are not saved.
- Returns:
{"lambda": (fig, axes), ...}and, when include_shrinkage isTrue, additional entries"lambda_intercept","tau2","delta2".- Return type:
Dict[str,Tuple[Figure,ndarray]]
- class TBIP(counts, vocab, num_topics, authors, batch_size, time_varying=False, beta_shape_init=None, beta_rate_init=None)[source]
Bases:
NumpyroModelTBIP Model
This class models topic-based ideal points (TBIP) in a set of documents authored by multiple individuals.
Initialize the TBIP model.
- Parameters:
counts (
csr_matrix) – A 2D sparse array of shape (D, V) representing the word counts in each document, where D is the number of documents and V is the vocabulary size.vocab (
ndarray) – A vocabulary array of shape (V,) containing word terms.num_topics (
int) – The number of topics (K). Must be > 0.authors (
ndarray) – An array of authors for each document.batch_size (
int) – The number of documents to be processed in each batch. Must satisfy 0 < batch_size <= D.time_varying (
bool) – Whether to model time-varying ideal points (default is False).beta_shape_init (
ndarray) – Initial shape parameters for the topic-word distributions (default is None). Must have shape (K, V) if provided.beta_rate_init (
ndarray) – Initial rate parameters for the topic-word distributions (default is None). Must have shape (K, V) if provided.
- Raises:
TypeError – If counts is not a sparse matrix.
ValueError – If dimensions are invalid or time_varying parameters have wrong shape.
- train_step(num_steps, lr)[source]
Train the TBIP model using stochastic variational inference.
Custom train function specified exclusively for TBIP objects.
- Parameters:
num_steps (
int) – Number of training steps. Must be > 0.lr (
float) – Learning rate for the optimizer. Must be > 0.
- Returns:
A dictionary containing the estimated parameter values after training.
- Return type:
dict- Raises:
ValueError – If num_steps or lr are invalid.
- return_topics()[source]
Return the dominant topic for each document.
Uses the LogNormal variational posterior for theta:
E[theta] = exp(mu + sigma^2 / 2).- Return type:
Tuple[ndarray,ndarray]- Returns:
categories (np.ndarray) – Array of topic indices for each document (shape: D,).
E_theta (np.ndarray) – Estimated topic proportions for each document (shape: D, K).
- Raises:
ValueError – If model has not been trained yet.
- return_beta()[source]
Return the topic-word association matrix.
Uses the LogNormal variational posterior for beta:
E[beta] = exp(mu + sigma^2 / 2).- Returns:
DataFrame with words as index and topics as columns.
- Return type:
DataFrame- Raises:
ValueError – If model has not been trained yet.
- return_ideal_points()[source]
Return ideal point estimates for all authors.
- Returns:
DataFrame with columns
['author', 'ideal_point', 'std']sorted by ideal point.- Return type:
DataFrame- Raises:
ValueError – If model has not been trained yet.
- return_ideological_words(topic, n=10)[source]
Return words with the strongest ideological loading for a topic.
For a given topic k, ranks words by the magnitude of their ideological coefficient
eta[k, :]. Words with large positiveetaare associated with higher ideal-point values, and vice versa.- Parameters:
topic (
int) – Topic index (0-based).n (
int) – Number of top words per direction (default 10).
- Returns:
DataFrame with columns
['word', 'eta', 'direction']where direction is'positive'or'negative'.- Return type:
DataFrame- Raises:
ValueError – If model has not been trained or topic index is invalid.
- plot_ideal_points(selected_authors=None, show_ci=False, ci=0.95, figsize=(12, 2), save_path=None)[source]
Plot the ideal points of authors on a 1-D axis.
- Parameters:
selected_authors (
Optional[list]) – Authors to label (default: all authors).show_ci (
bool) – If True, display horizontal error bars showing the credible interval derived fromsigma_x.ci (
float) – Credible-interval level when show_ci is True (default 0.95).figsize (
Tuple[float,float]) – Figure size (default(12, 2)).save_path (
Optional[str]) – Path to save the figure.
- Return type:
Tuple[Figure,Axes]
- class FlaxEncoder(num_topics, hidden, parent=<flax.linen.module._Sentinel object>, name=None)[source]
Bases:
ModuleNeural network encoder for variational inference.
- Parameters:
num_topics (int)
hidden (int)
parent (Type[Module] | Scope | Type[_Sentinel] | None)
name (str | None)
- num_topics
Number of topics K.
- Type:
int
- hidden
Hidden layer dimension.
- Type:
int
- num_topics: int
- hidden: int
- name: str | None = None
- parent: Type[Module] | Scope | Type[_Sentinel] | None = None
- scope: Scope | None = None
- class ETM(counts, vocab, num_topics, batch_size, embeddings_mapping, embed_size=300)[source]
Bases:
NumpyroModelEmbedded Topic Model (ETM).
Learns topic representations in word embedding space using neural variational inference. Combines neural networks with Bayesian topic modeling for improved interpretability.
- Parameters:
counts (
csr_matrix) – Document-term matrix of shape (D, V) with word counts.vocab (
ndarray) – Vocabulary array of shape (V,) containing word terms.num_topics (
int) – Number of topics K. Must be > 0.batch_size (
int) – Mini-batch size for stochastic variational inference. Must satisfy 0 < batch_size <= D.embeddings_mapping (
Dict) – Mapping from words to embedding vectors.embed_size (
int) – Embedding dimension (default is 300).
- D
Number of documents.
- Type:
int
- V
Vocabulary size.
- Type:
int
- K
Number of topics.
- Type:
int
- rho
Word embedding matrix of shape (V, embed_size).
- Type:
np.ndarray
- encoder
Neural encoder for variational inference.
- Type:
FlaxEncoder
Initialize the ETM model.
- Parameters:
counts (
csr_matrix) – Document-term matrix.vocab (
ndarray) – Vocabulary array.num_topics (
int) – Number of topics.batch_size (
int) – Mini-batch size.embeddings_mapping (
Dict) – Word to embedding mapping.embed_size (
int) – Embedding dimension (default is 300).
- Raises:
TypeError – If counts is not a sparse matrix.
ValueError – If dimensions are invalid or embeddings_mapping is empty.
- return_topics()[source]
Extract dominant topic per document and topic proportions.
The topic proportions
thetaare obtained by passing the normalised bag-of-words through the trained neural encoder and applying softmax.- Return type:
Tuple[ndarray,ndarray]- Returns:
categories (np.ndarray) – Dominant topic index per document (shape: D,).
E_theta (np.ndarray) – Document-topic proportions (shape: D, K).
- Raises:
ValueError – If model has not been trained yet.
- return_beta()[source]
Extract the topic-word distribution matrix.
Computes
beta = softmax(rho @ alpha)whererhoare the word embeddings andalphais the learned embedding-to-topic projection matrix.- Returns:
DataFrame of shape (V, K) with words as index and topics as columns. Each column sums to 1.
- Return type:
DataFrame- Raises:
ValueError – If model has not been trained yet.
- class Metrics(loss=<factory>, coherence_scores=None, diversity=None)[source]
Bases:
objectData class for storing training and evaluation metrics.
Tracks model performance during training by recording loss values at each iteration, and stores topic-quality metrics computed post-fitting.
- Parameters:
loss (List[Any])
coherence_scores (DataFrame | None)
diversity (float | None)
- loss
List of loss values for each training iteration.
- Type:
List[float]
- coherence_scores
Per-topic coherence scores computed by
NumpyroModel.compute_topic_coherence().- Type:
pd.DataFrame or None
- diversity
Topic diversity score computed by
NumpyroModel.compute_topic_diversity().- Type:
float or None
Examples
>>> metrics = Metrics(loss=[]) >>> metrics.loss.append(0.5) >>> len(metrics.loss) 1
- loss: List[Any]
- coherence_scores: DataFrame | None = None
- diversity: float | None = None
- reset()[source]
Reset all metrics to empty state.
- Return type:
None
- last_loss()[source]
Get the most recent loss value.
- Return type:
Any