Embedded Topic Models (ETM)
Embedded Topic Models (ETM) integrate pre-trained word embeddings (e.g., Word2Vec, FastText, GloVe) into topic modeling. This often produces more semantically coherent topics.
Why Use Embeddings?
Traditional topic models (PF, LDA) treat words as independent. ETM uses embeddings to leverage semantic relationships:
Without embeddings (PF):
Words “car” and “automobile” are completely separate
No semantic similarity captured
May discover both in different topics
With embeddings (ETM):
“car” and “automobile” are similar in embedding space
Model encourages co-occurrence in topics
Topics are more coherent and interpretable
Benefits:
✓ More coherent topics ✓ Better handling of synonyms ✓ Leverages external semantic knowledge ✓ Often faster convergence
When to Use ETM
Use ETM when:
✓ You have pre-trained embeddings available ✓ You want highly coherent topics ✓ Semantically similar words should group together ✓ You have sufficient computing resources
Consider basic PF if:
✗ No pre-trained embeddings available (or corpus-specific) ✗ You want complete control over topic formation ✗ Computational resources are limited ✗ Embedding artifacts would introduce bias
Model Overview
ETM extends PF by constraining topics in embedding space:
Key idea: Topics (distributions over words) are located in the word embedding space.
Mechanism:
Each word has an embedding vector (pre-trained)
Each topic is located at a point in embedding space
Word probability in topic depends on distance/similarity
Formally:
P(word w | topic z) ∝ exp(-||embedding_w - topic_center_z||²)
Close words in embedding space have higher probability in same topic.
Basic Usage
from poisson_topicmodels import ETM
import numpy as np
# Pre-trained embeddings: (vocab_size, embedding_dim)
embeddings = load_pretrained_embeddings('glove') # Shape: (500, 300)
model = ETM(
counts=counts,
vocab=vocab,
embeddings_mapping=embeddings,
num_topics=10,
batch_size=32,
)
params = model.train_step(num_steps=200, lr=0.01, random_seed=42)
# Results use the same API as other models
model.summary()
top_words = model.return_top_words_per_topic(n=10)
beta = model.return_beta() # uses softmax(ρ @ α)
categories, e_theta = model.return_topics() # neural encoder inference
Loading Pre-trained Embeddings
From file (GloVe format):
import numpy as np
def load_glove_embeddings(filepath, vocab, embedding_dim=300):
"""Load GloVe embeddings for given vocabulary."""
embeddings = {}
with open(filepath, 'r') as f:
for line in f:
parts = line.split()
word = parts[0]
if word in vocab:
embeddings[word] = np.array(parts[1:], dtype=np.float32)
# Create matrix for vocab
embedding_matrix = np.zeros((len(vocab), embedding_dim), dtype=np.float32)
for i, word in enumerate(vocab):
if word in embeddings:
embedding_matrix[i] = embeddings[word]
else:
# Random embedding for OOV words
embedding_matrix[i] = np.random.randn(embedding_dim) * 0.1
return embedding_matrix
embeddings = load_glove_embeddings('glove.6B.300d.txt', vocab)
From gensim:
from gensim.models import Word2Vec
# Load Word2Vec model
w2v_model = Word2Vec.load('word2vec_model.bin')
# Extract embeddings for vocabulary
embedding_dim = w2v_model.vector_size
embeddings = np.zeros((len(vocab), embedding_dim))
for i, word in enumerate(vocab):
if word in w2v_model.wv:
embeddings[i] = w2v_model.wv[word]
else:
embeddings[i] = np.random.randn(embedding_dim) * 0.1
From fastText:
from fastText import load_model
model = load_model('fasttext_model.bin')
embeddings = np.array([
model.get_word_vector(word) for word in vocab
]).astype(np.float32)
Practical Example: News Classification
from poisson_topicmodels import ETM
from gensim.models import Word2Vec
# Train or load Word2Vec on news corpus
w2v = Word2Vec(sentences=tokenized_documents, vector_size=300, window=5)
# Create embedding matrix
embeddings = np.array([
w2v.wv[word] if word in w2v.wv else np.random.randn(300) * 0.1
for word in vocab
])
# Train ETM
model = ETM(
counts=counts,
vocab=vocab,
embeddings_mapping=embeddings,
num_topics=15,
batch_size=64,
)
model.train_step(num_steps=200, lr=0.01)
# Inspect topics
model.summary()
top_words = model.return_top_words_per_topic(n=15)
for topic_id, words in top_words.items():
print(f"Topic {topic_id}: {', '.join(words)}")
Comparing ETM vs Standard Models
Quality comparison:
# Train multiple models
pf_model = PF(counts, vocab, num_topics=10, batch_size=32)
pf_model.train_step(num_steps=200, lr=0.01)
etm_model = ETM(counts, vocab, embeddings_mapping=embeddings, num_topics=10, batch_size=32)
etm_model.train_step(num_steps=200, lr=0.01)
# Calculate coherence
pf_coherence = pf_model.compute_topic_coherence()
etm_coherence = etm_model.compute_topic_coherence()
print(f"PF Coherence: {pf_coherence['coherence'].mean():.3f}")
print(f"ETM Coherence: {etm_coherence['coherence'].mean():.3f}")
# ETM usually has higher coherence
Visual comparison:
# Display topic evolution
import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# PF topics
for topic_id in range(5):
top_words_pf = pf_model.return_top_words_per_topic(n=5)[topic_id]
axes[0].text(0.1, 0.9 - topic_id * 0.15,
f"T{topic_id}: {', '.join(top_words_pf)}")
axes[0].set_title('PF Topics')
# ETM topics
for topic_id in range(5):
top_words_etm = etm_model.return_top_words_per_topic(n=5)[topic_id]
axes[1].text(0.1, 0.9 - topic_id * 0.15,
f"T{topic_id}: {', '.join(top_words_etm)}")
axes[1].set_title('ETM Topics')
plt.tight_layout()
plt.show()
Advanced: Custom ETM Variants
Combine with seeding:
# ETM with keyword guidance
seeds = {
0: ['climate', 'carbon', 'greenhouse'],
1: ['economy', 'market', 'trade'],
}
# If ETM supports seeding (check documentation)
model = ETM(
counts=counts,
vocab=vocab,
embeddings=embeddings,
num_topics=10,
seeds=seeds,
seed_strength=10.0
)
Combine with covariates:
# ETM with covariate effects
covariates = np.random.randn(num_docs, 2)
# If supported
model = ETM(
counts=counts,
vocab=vocab,
embeddings=embeddings,
num_topics=10,
covariates=covariates
)
Embedding Quality Matters
Good embeddings:
Trained on large, relevant corpus
Capture domain-specific semantics
Adequate dimensionality (usually 300+)
Word coverage matches your vocabulary
Issues with bad embeddings:
Random or poorly trained embeddings don’t help
May actually hurt performance
Out-of-vocabulary words hurt coverage
Mismatch between corpus domain and embedding domain
Best practices:
Use embeddings trained on similar corpus
Check coverage:
coverage = sum(word in embeddings for word in vocab)Verify quality: do related words have similar embeddings?
Compare ETM vs PF on your data
Troubleshooting ETM
Problem: ETM doesn’t improve over basic PF
Solution: - Check embedding quality (is coverage good?) - Try different embedding model - Ensure preprocessing matches embedding vocabulary - ETM might not help for all datasets
Problem: Training is slow
Solution: - Embeddings add computational cost - Reduce num_topics - Increase batch_size - Reduce vocabulary size - Use GPU
Problem: Topics look worse than PF
Solution: - Bad embedding quality - Domain mismatch (embeddings from different corpus) - Embedding dimensionality too low (try higher-dim embeddings) - Training not converged (more iterations)
Problem: Many OOV (out-of-vocabulary) words
Solution: - Check embedding file covers words in vocab - Preprocess to match embedding vocabulary - Use subword embeddings (fastText) instead of word-level
Evaluation
Metrics for ETM:
# Standard metrics still apply
beta = etm_model.return_beta()
categories, e_theta = etm_model.return_topics()
# Coherence
coherence_df = etm_model.compute_topic_coherence()
# Topic diversity
diversity = etm_model.compute_topic_diversity()
print(f"Topic diversity: {diversity:.3f}")
# Compare with baseline
pf_coherence = pf_model.compute_topic_coherence()
improvement = (coherence_df['coherence'].mean() - pf_coherence['coherence'].mean()) / abs(pf_coherence['coherence'].mean())
print(f"ETM improves coherence by {improvement:.1%}")
Relationship to Other Models
ETM vs PF: Adds semantic constraints through embeddings
ETM + SPF = Guided ETM: Combine embedding quality with domain guidance
ETM + CPF: Use embeddings + metadata (if supported)
ETM + TBIP: Ideal points with better topic discovery
When to Stack Models:
Use basic PF first to understand topics
Add embeddings (ETM) if coherence is a concern
Add seeds (SPF) if you have domain knowledge
Add TBIP/CPF if you have additional structure
Next Steps
Tutorials - Advanced training techniques
How-To Guides - Practical recipes
API Reference - Complete ETM API documentation
Explore different embedding sources for your domain