Covariate Models (CPF & CSPF)
Covariate-augmented models extend basic topic modeling to account for document-level metadata. Use them when you want to understand how topics vary across groups or conditions.
Covariate Poisson Factorization (CPF)
CPF models how external factors (covariates) influence topic distributions.
When to use:
✓ Document metadata available (author, date, category) ✓ Want to understand topic variation across groups ✓ Interested in covariate effects on topics
Example use cases:
Topic proportions across different authors
How topics evolve over time
Topic differences between datasets/corpora
Topic structure by document category
The Model
CPF extends PF by making topic distributions depend on covariates:
Standard PF:
θ_d ~ Gamma(α, α) [independent from anything]
CPF:
θ_d ~ Gamma(exp(γ + x_d * β), α) [depends on document covariates x_d]
Where: - x_d = covariate values for document d - β = covariate effects (regression coefficients) - γ = baseline (intercept)
Interpretation:
If β_k > 0: Higher covariate value → higher topic k intensity
If β_k < 0: Higher covariate value → lower topic k intensity
β_k ≈ 0: Covariate has little effect on topic k
Usage Example
from poisson_topicmodels import CPF
import numpy as np
import pandas as pd
# Document covariates (e.g., author type scores)
# Shape: (num_documents, num_covariates)
covariates = np.random.randn(100, 2) # 2 covariates
X = pd.DataFrame(covariates, columns=['covariate_0', 'covariate_1'])
model = CPF(
counts=counts,
vocab=vocab,
num_topics=10,
X_design_matrix=X,
batch_size=32,
)
params = model.train_step(num_steps=200, lr=0.01, random_seed=42)
# Extract covariate effects
effects = model.return_covariate_effects() # DataFrame (covariates × topics)
Interpreting Covariate Effects
effects = model.return_covariate_effects() # DataFrame
print(effects)
# Get credible intervals
effects_ci = model.return_covariate_effects_ci(ci=0.90)
print(effects_ci.head(10))
# Columns: covariate, topic, mean, lower, upper
# Intervals excluding zero suggest significant effects
Visualization (built-in forest plot):
# Forest plot of covariate effects with credible intervals
fig, axes = model.plot_cov_effects(ci=0.90)
# Or manually:
import matplotlib.pyplot as plt
plt.imshow(effects.values, cmap='RdBu_r', vmin=-3, vmax=3)
plt.xlabel('Topics')
plt.ylabel('Covariates')
plt.colorbar(label='Effect size')
plt.title('Covariate Effects on Topics')
plt.show()
Practical Example: Time Evolution
Analyze how topics change across time periods:
# Create time-based covariate
time_covariate = np.repeat(np.arange(10), 10) # 10 decades, 10 docs each
covariates = time_covariate.reshape(-1, 1) / 10 # Normalize
model = CPF(counts, vocab, num_topics=5, X_design_matrix=covariates, batch_size=32)
model.train_step(num_steps=200, lr=0.01)
# Topic 0's time effect
effects = model.return_covariate_effects()
time_effect = effects[0, 0]
# If positive: topic 0 increases over time
# If negative: topic 0 decreases over time
Covariate Seeded PF (CSPF)
CSPF combines seeded guidance (from SPF) with covariate modeling (from CPF).
Use when:
✓ You have prior knowledge about topics (seeds) ✓ You have document metadata (covariates) ✓ You want guided discovery with metadata effects
Usage
from poisson_topicmodels import CSPF
# Seeds for guided discovery
seeds = {
0: ['election', 'vote', 'candidate'],
1: ['market', 'economy', 'trade'],
}
# Metadata effects
covariates = np.random.randn(100, 1)
model = CSPF(
counts=counts,
vocab=vocab,
keywords=seeds,
residual_topics=0,
X_design_matrix=covariates,
batch_size=32,
)
params = model.train_step(num_steps=200, lr=0.01, random_seed=42)
Practical Example: Geographic Topic Analysis
# Documents with geographic metadata
# Analyze how topics differ by region
regions = np.array([0, 0, 1, 1, 2, 2, ...]) # Region IDs
# Convert to one-hot encoding
num_regions = 3
covariates = np.zeros((len(regions), num_regions))
for i, region in enumerate(regions):
covariates[i, region] = 1
model = CSPF(
counts=counts,
vocab=vocab,
keywords={},
residual_topics=5,
X_design_matrix=covariates,
batch_size=32,
)
model.train_step(num_steps=200, lr=0.01)
# Analyze regional topic differences
effects = model.return_covariate_effects()
effects_ci = model.return_covariate_effects_ci(ci=0.90)
model.plot_cov_effects(ci=0.90)
Tips for Covariate Modeling
Centering: Center continuous covariates
covariates = (covariates - covariates.mean(axis=0)) / covariates.std(axis=0)
Scaling: Normalize to [0,1] or standardize
from sklearn.preprocessing import StandardScaler
covariates = StandardScaler().fit_transform(covariates)
Categorical: Convert to dummy variables
import pandas as pd
df = pd.DataFrame({'category': ['A', 'B', 'A', 'B', ...]})
covariates = pd.get_dummies(df, drop_first=False).values
Interpretation: Document what covariates represent
# Label your covariates
covariate_names = ['time_period', 'political_lean', 'media_type']
# Use when reporting effects
Common Patterns
Author Effects: How different authors use topics
author_ids = np.array([0, 1, 0, 2, 1, ...])
author_dummies = np.eye(num_authors)[author_ids]
model = CPF(counts, vocab, num_topics=10, X_design_matrix=author_dummies, batch_size=32)
Category Effects: How topics differ by category
categories = ['news', 'opinion', 'news', 'opinion', ...]
category_ids = [1 if c == 'opinion' else 0 for c in categories]
covariates = np.array(category_ids).reshape(-1, 1)
model = CPF(counts, vocab, num_topics=10, X_design_matrix=covariates, batch_size=32)
Temporal Effects: How topics evolve over time
timestamps = np.array([2010, 2011, 2015, ...])
covariates = ((timestamps - timestamps.min()) / (timestamps.max() - timestamps.min())).reshape(-1, 1)
model = CPF(counts, vocab, num_topics=10, X_design_matrix=covariates, batch_size=32)
Visualization Examples
Effect Heatmap:
effects = model.return_covariate_effects()
import seaborn as sns
sns.heatmap(effects, cmap='RdBu_r', center=0, annot=True)
plt.title('Covariate Effects by Topic')
plt.ylabel('Covariate')
plt.xlabel('Topic')
Built-in Forest Plot (recommended):
# Forest plot with credible intervals — publication-ready
fig, axes = model.plot_cov_effects(ci=0.90)
Document-Topic by Category:
categories_arr, e_theta = model.return_topics()
# Average topics by category
for category_id in range(num_categories):
mask = categories == category_id
avg_topics = doc_topics[mask].mean(axis=0)
plt.plot(avg_topics, marker='o', label=f'Category {category_id}')
plt.xlabel('Topic')
plt.ylabel('Average Intensity')
plt.legend()
Troubleshooting
Problem: Covariate effects are near zero
Solution: - Covariates may not influence topics meaningfully - Increase number of iterations - Check covariate variation (are they constant?) - Covariates might truly have no effect
Problem: Training diverges or NaNs
Solution: - Normalize covariates to reasonable scale - Reduce learning rate - Check covariates don’t have extreme values
Problem: Over-reliance on covariates (ignores data)
Solution: - Reduce covariate weights (if supported) - Use fewer covariates - Increase training data
Next Steps
Ideal Points Models (TBIP) - Estimate author positions with TBIP
Tutorials - Advanced modeling techniques
API Reference - Full API reference for CPF/CSPF