Embedded Topic Models (ETM)

Embedded Topic Models (ETM) integrate pre-trained word embeddings (e.g., Word2Vec, FastText, GloVe) into topic modeling. This often produces more semantically coherent topics.

Why Use Embeddings?

Traditional topic models (PF, LDA) treat words as independent. ETM uses embeddings to leverage semantic relationships:

Without embeddings (PF):

Words “car” and “automobile” are completely separate
No semantic similarity captured
May discover both in different topics

With embeddings (ETM):

“car” and “automobile” are similar in embedding space
Model encourages co-occurrence in topics
Topics are more coherent and interpretable

Benefits:

✓ More coherent topics ✓ Better handling of synonyms ✓ Leverages external semantic knowledge ✓ Often faster convergence

When to Use ETM

Use ETM when:

✓ You have pre-trained embeddings available ✓ You want highly coherent topics ✓ Semantically similar words should group together ✓ You have sufficient computing resources

Consider basic PF if:

✗ No pre-trained embeddings available (or corpus-specific) ✗ You want complete control over topic formation ✗ Computational resources are limited ✗ Embedding artifacts would introduce bias

Model Overview

ETM extends PF by constraining topics in embedding space:

Key idea: Topics (distributions over words) are located in the word embedding space.

Mechanism:

Each word has an embedding vector (pre-trained)
Each topic is located at a point in embedding space
Word probability in topic depends on distance/similarity

Formally:

P(word w | topic z) ∝ exp(-||embedding_w - topic_center_z||²)

Close words in embedding space have higher probability in same topic.

Basic Usage

from poisson_topicmodels import ETM
import numpy as np

# Pre-trained embeddings: (vocab_size, embedding_dim)
embeddings = load_pretrained_embeddings('glove')  # Shape: (500, 300)

model = ETM(
    counts=counts,
    vocab=vocab,
    embeddings_mapping=embeddings,
    num_topics=10,
    batch_size=32,
)

params = model.train_step(num_steps=200, lr=0.01, random_seed=42)

# Results use the same API as other models
model.summary()
top_words = model.return_top_words_per_topic(n=10)
beta = model.return_beta()                # uses softmax(ρ @ α)
categories, e_theta = model.return_topics()  # neural encoder inference

Loading Pre-trained Embeddings

From file (GloVe format):

import numpy as np

def load_glove_embeddings(filepath, vocab, embedding_dim=300):
    """Load GloVe embeddings for given vocabulary."""
    embeddings = {}
    with open(filepath, 'r') as f:
        for line in f:
            parts = line.split()
            word = parts[0]
            if word in vocab:
                embeddings[word] = np.array(parts[1:], dtype=np.float32)

    # Create matrix for vocab
    embedding_matrix = np.zeros((len(vocab), embedding_dim), dtype=np.float32)
    for i, word in enumerate(vocab):
        if word in embeddings:
            embedding_matrix[i] = embeddings[word]
        else:
            # Random embedding for OOV words
            embedding_matrix[i] = np.random.randn(embedding_dim) * 0.1

    return embedding_matrix

embeddings = load_glove_embeddings('glove.6B.300d.txt', vocab)

From gensim:

from gensim.models import Word2Vec

# Load Word2Vec model
w2v_model = Word2Vec.load('word2vec_model.bin')

# Extract embeddings for vocabulary
embedding_dim = w2v_model.vector_size
embeddings = np.zeros((len(vocab), embedding_dim))

for i, word in enumerate(vocab):
    if word in w2v_model.wv:
        embeddings[i] = w2v_model.wv[word]
    else:
        embeddings[i] = np.random.randn(embedding_dim) * 0.1

From fastText:

from fastText import load_model

model = load_model('fasttext_model.bin')

embeddings = np.array([
    model.get_word_vector(word) for word in vocab
]).astype(np.float32)

Practical Example: News Classification

from poisson_topicmodels import ETM
from gensim.models import Word2Vec

# Train or load Word2Vec on news corpus
w2v = Word2Vec(sentences=tokenized_documents, vector_size=300, window=5)

# Create embedding matrix
embeddings = np.array([
    w2v.wv[word] if word in w2v.wv else np.random.randn(300) * 0.1
    for word in vocab
])

# Train ETM
model = ETM(
    counts=counts,
    vocab=vocab,
    embeddings_mapping=embeddings,
    num_topics=15,
    batch_size=64,
)

model.train_step(num_steps=200, lr=0.01)

# Inspect topics
model.summary()
top_words = model.return_top_words_per_topic(n=15)
for topic_id, words in top_words.items():
    print(f"Topic {topic_id}: {', '.join(words)}")

Comparing ETM vs Standard Models

Quality comparison:

# Train multiple models
pf_model = PF(counts, vocab, num_topics=10, batch_size=32)
pf_model.train_step(num_steps=200, lr=0.01)

etm_model = ETM(counts, vocab, embeddings_mapping=embeddings, num_topics=10, batch_size=32)
etm_model.train_step(num_steps=200, lr=0.01)

# Calculate coherence
pf_coherence = pf_model.compute_topic_coherence()
etm_coherence = etm_model.compute_topic_coherence()

print(f"PF Coherence: {pf_coherence['coherence'].mean():.3f}")
print(f"ETM Coherence: {etm_coherence['coherence'].mean():.3f}")
# ETM usually has higher coherence

Visual comparison:

# Display topic evolution
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# PF topics
for topic_id in range(5):
    top_words_pf = pf_model.return_top_words_per_topic(n=5)[topic_id]
    axes[0].text(0.1, 0.9 - topic_id * 0.15,
                 f"T{topic_id}: {', '.join(top_words_pf)}")
axes[0].set_title('PF Topics')

# ETM topics
for topic_id in range(5):
    top_words_etm = etm_model.return_top_words_per_topic(n=5)[topic_id]
    axes[1].text(0.1, 0.9 - topic_id * 0.15,
                 f"T{topic_id}: {', '.join(top_words_etm)}")
axes[1].set_title('ETM Topics')

plt.tight_layout()
plt.show()

Advanced: Custom ETM Variants

Combine with seeding:

# ETM with keyword guidance
seeds = {
    0: ['climate', 'carbon', 'greenhouse'],
    1: ['economy', 'market', 'trade'],
}

# If ETM supports seeding (check documentation)
model = ETM(
    counts=counts,
    vocab=vocab,
    embeddings=embeddings,
    num_topics=10,
    seeds=seeds,
    seed_strength=10.0
)

Combine with covariates:

# ETM with covariate effects
covariates = np.random.randn(num_docs, 2)

# If supported
model = ETM(
    counts=counts,
    vocab=vocab,
    embeddings=embeddings,
    num_topics=10,
    covariates=covariates
)

Embedding Quality Matters

Good embeddings:

Trained on large, relevant corpus
Capture domain-specific semantics
Adequate dimensionality (usually 300+)
Word coverage matches your vocabulary

Issues with bad embeddings:

Random or poorly trained embeddings don’t help
May actually hurt performance
Out-of-vocabulary words hurt coverage
Mismatch between corpus domain and embedding domain

Best practices:

Use embeddings trained on similar corpus
Check coverage: coverage = sum(word in embeddings for word in vocab)
Verify quality: do related words have similar embeddings?
Compare ETM vs PF on your data

Troubleshooting ETM

Problem: ETM doesn’t improve over basic PF

Solution: - Check embedding quality (is coverage good?) - Try different embedding model - Ensure preprocessing matches embedding vocabulary - ETM might not help for all datasets

Problem: Training is slow

Solution: - Embeddings add computational cost - Reduce num_topics - Increase batch_size - Reduce vocabulary size - Use GPU

Problem: Topics look worse than PF

Solution: - Bad embedding quality - Domain mismatch (embeddings from different corpus) - Embedding dimensionality too low (try higher-dim embeddings) - Training not converged (more iterations)

Problem: Many OOV (out-of-vocabulary) words

Solution: - Check embedding file covers words in vocab - Preprocess to match embedding vocabulary - Use subword embeddings (fastText) instead of word-level

Evaluation

Metrics for ETM:

# Standard metrics still apply
beta = etm_model.return_beta()
categories, e_theta = etm_model.return_topics()

# Coherence
coherence_df = etm_model.compute_topic_coherence()

# Topic diversity
diversity = etm_model.compute_topic_diversity()
print(f"Topic diversity: {diversity:.3f}")

# Compare with baseline
pf_coherence = pf_model.compute_topic_coherence()
improvement = (coherence_df['coherence'].mean() - pf_coherence['coherence'].mean()) / abs(pf_coherence['coherence'].mean())
print(f"ETM improves coherence by {improvement:.1%}")

Relationship to Other Models

ETM vs PF: Adds semantic constraints through embeddings

ETM + SPF = Guided ETM: Combine embedding quality with domain guidance

ETM + CPF: Use embeddings + metadata (if supported)

ETM + TBIP: Ideal points with better topic discovery

When to Stack Models:

Use basic PF first to understand topics
Add embeddings (ETM) if coherence is a concern
Add seeds (SPF) if you have domain knowledge
Add TBIP/CPF if you have additional structure

Next Steps

Tutorials - Advanced training techniques
How-To Guides - Practical recipes
API Reference - Complete ETM API documentation
Explore different embedding sources for your domain