Covariate Models (CPF & CSPF)

Covariate-augmented models extend basic topic modeling to account for document-level metadata. Use them when you want to understand how topics vary across groups or conditions.

Covariate Poisson Factorization (CPF)

CPF models how external factors (covariates) influence topic distributions.

When to use:

✓ Document metadata available (author, date, category) ✓ Want to understand topic variation across groups ✓ Interested in covariate effects on topics

Example use cases:

  • Topic proportions across different authors

  • How topics evolve over time

  • Topic differences between datasets/corpora

  • Topic structure by document category

The Model

CPF extends PF by making topic distributions depend on covariates:

Standard PF:
θ_d ~ Gamma(α, α)  [independent from anything]

CPF:
θ_d ~ Gamma(exp(γ + x_d * β), α)  [depends on document covariates x_d]

Where: - x_d = covariate values for document d - β = covariate effects (regression coefficients) - γ = baseline (intercept)

Interpretation:

  • If β_k > 0: Higher covariate value → higher topic k intensity

  • If β_k < 0: Higher covariate value → lower topic k intensity

  • β_k ≈ 0: Covariate has little effect on topic k

Usage Example

from poisson_topicmodels import CPF
import numpy as np

import pandas as pd

# Document covariates (e.g., author type scores)
# Shape: (num_documents, num_covariates)
covariates = np.random.randn(100, 2)  # 2 covariates
X = pd.DataFrame(covariates, columns=['covariate_0', 'covariate_1'])

model = CPF(
    counts=counts,
    vocab=vocab,
    num_topics=10,
    X_design_matrix=X,
    batch_size=32,
)

params = model.train_step(num_steps=200, lr=0.01, random_seed=42)

# Extract covariate effects
effects = model.return_covariate_effects()  # DataFrame (covariates × topics)

Interpreting Covariate Effects

effects = model.return_covariate_effects()  # DataFrame
print(effects)

# Get credible intervals
effects_ci = model.return_covariate_effects_ci(ci=0.90)
print(effects_ci.head(10))
# Columns: covariate, topic, mean, lower, upper
# Intervals excluding zero suggest significant effects

Visualization (built-in forest plot):

# Forest plot of covariate effects with credible intervals
fig, axes = model.plot_cov_effects(ci=0.90)

# Or manually:
import matplotlib.pyplot as plt
plt.imshow(effects.values, cmap='RdBu_r', vmin=-3, vmax=3)
plt.xlabel('Topics')
plt.ylabel('Covariates')
plt.colorbar(label='Effect size')
plt.title('Covariate Effects on Topics')
plt.show()

Practical Example: Time Evolution

Analyze how topics change across time periods:

# Create time-based covariate
time_covariate = np.repeat(np.arange(10), 10)  # 10 decades, 10 docs each
covariates = time_covariate.reshape(-1, 1) / 10  # Normalize

model = CPF(counts, vocab, num_topics=5, X_design_matrix=covariates, batch_size=32)
model.train_step(num_steps=200, lr=0.01)

# Topic 0's time effect
effects = model.return_covariate_effects()
time_effect = effects[0, 0]
# If positive: topic 0 increases over time
# If negative: topic 0 decreases over time

Covariate Seeded PF (CSPF)

CSPF combines seeded guidance (from SPF) with covariate modeling (from CPF).

Use when:

✓ You have prior knowledge about topics (seeds) ✓ You have document metadata (covariates) ✓ You want guided discovery with metadata effects

Usage

from poisson_topicmodels import CSPF

# Seeds for guided discovery
seeds = {
    0: ['election', 'vote', 'candidate'],
    1: ['market', 'economy', 'trade'],
}

# Metadata effects
covariates = np.random.randn(100, 1)

model = CSPF(
    counts=counts,
    vocab=vocab,
    keywords=seeds,
    residual_topics=0,
    X_design_matrix=covariates,
    batch_size=32,
)

params = model.train_step(num_steps=200, lr=0.01, random_seed=42)

Practical Example: Geographic Topic Analysis

# Documents with geographic metadata
# Analyze how topics differ by region

regions = np.array([0, 0, 1, 1, 2, 2, ...])  # Region IDs

# Convert to one-hot encoding
num_regions = 3
covariates = np.zeros((len(regions), num_regions))
for i, region in enumerate(regions):
    covariates[i, region] = 1

model = CSPF(
    counts=counts,
    vocab=vocab,
    keywords={},
    residual_topics=5,
    X_design_matrix=covariates,
    batch_size=32,
)

model.train_step(num_steps=200, lr=0.01)

# Analyze regional topic differences
effects = model.return_covariate_effects()
effects_ci = model.return_covariate_effects_ci(ci=0.90)
model.plot_cov_effects(ci=0.90)

Tips for Covariate Modeling

Centering: Center continuous covariates

covariates = (covariates - covariates.mean(axis=0)) / covariates.std(axis=0)

Scaling: Normalize to [0,1] or standardize

from sklearn.preprocessing import StandardScaler
covariates = StandardScaler().fit_transform(covariates)

Categorical: Convert to dummy variables

import pandas as pd
df = pd.DataFrame({'category': ['A', 'B', 'A', 'B', ...]})
covariates = pd.get_dummies(df, drop_first=False).values

Interpretation: Document what covariates represent

# Label your covariates
covariate_names = ['time_period', 'political_lean', 'media_type']
# Use when reporting effects

Common Patterns

Author Effects: How different authors use topics

author_ids = np.array([0, 1, 0, 2, 1, ...])
author_dummies = np.eye(num_authors)[author_ids]

model = CPF(counts, vocab, num_topics=10, X_design_matrix=author_dummies, batch_size=32)

Category Effects: How topics differ by category

categories = ['news', 'opinion', 'news', 'opinion', ...]
category_ids = [1 if c == 'opinion' else 0 for c in categories]
covariates = np.array(category_ids).reshape(-1, 1)

model = CPF(counts, vocab, num_topics=10, X_design_matrix=covariates, batch_size=32)

Temporal Effects: How topics evolve over time

timestamps = np.array([2010, 2011, 2015, ...])
covariates = ((timestamps - timestamps.min()) / (timestamps.max() - timestamps.min())).reshape(-1, 1)

model = CPF(counts, vocab, num_topics=10, X_design_matrix=covariates, batch_size=32)

Visualization Examples

Effect Heatmap:

effects = model.return_covariate_effects()
import seaborn as sns
sns.heatmap(effects, cmap='RdBu_r', center=0, annot=True)
plt.title('Covariate Effects by Topic')
plt.ylabel('Covariate')
plt.xlabel('Topic')

Built-in Forest Plot (recommended):

# Forest plot with credible intervals — publication-ready
fig, axes = model.plot_cov_effects(ci=0.90)

Document-Topic by Category:

categories_arr, e_theta = model.return_topics()

# Average topics by category
for category_id in range(num_categories):
    mask = categories == category_id
    avg_topics = doc_topics[mask].mean(axis=0)
    plt.plot(avg_topics, marker='o', label=f'Category {category_id}')

plt.xlabel('Topic')
plt.ylabel('Average Intensity')
plt.legend()

Troubleshooting

Problem: Covariate effects are near zero

Solution: - Covariates may not influence topics meaningfully - Increase number of iterations - Check covariate variation (are they constant?) - Covariates might truly have no effect

Problem: Training diverges or NaNs

Solution: - Normalize covariates to reasonable scale - Reduce learning rate - Check covariates don’t have extreme values

Problem: Over-reliance on covariates (ignores data)

Solution: - Reduce covariate weights (if supported) - Use fewer covariates - Increase training data

Next Steps