.. _embedded_models: ================================================================================ Embedded Topic Models (ETM) ================================================================================ **Embedded Topic Models (ETM)** integrate pre-trained word embeddings (e.g., Word2Vec, FastText, GloVe) into topic modeling. This often produces more semantically coherent topics. Why Use Embeddings? =================== Traditional topic models (PF, LDA) treat words as independent. ETM uses embeddings to leverage semantic relationships: **Without embeddings (PF)**: - Words "car" and "automobile" are completely separate - No semantic similarity captured - May discover both in different topics **With embeddings (ETM)**: - "car" and "automobile" are similar in embedding space - Model encourages co-occurrence in topics - Topics are more coherent and interpretable Benefits: ✓ More coherent topics ✓ Better handling of synonyms ✓ Leverages external semantic knowledge ✓ Often faster convergence When to Use ETM =============== Use ETM when: ✓ You have pre-trained embeddings available ✓ You want highly coherent topics ✓ Semantically similar words should group together ✓ You have sufficient computing resources Consider basic PF if: ✗ No pre-trained embeddings available (or corpus-specific) ✗ You want complete control over topic formation ✗ Computational resources are limited ✗ Embedding artifacts would introduce bias Model Overview ============== ETM extends PF by constraining topics in embedding space: **Key idea**: Topics (distributions over words) are located in the word embedding space. **Mechanism**: 1. Each word has an embedding vector (pre-trained) 2. Each topic is located at a point in embedding space 3. Word probability in topic depends on distance/similarity **Formally**: .. code-block:: text P(word w | topic z) ∝ exp(-||embedding_w - topic_center_z||²) Close words in embedding space have higher probability in same topic. Basic Usage =========== .. code-block:: python from poisson_topicmodels import ETM import numpy as np # Pre-trained embeddings: (vocab_size, embedding_dim) embeddings = load_pretrained_embeddings('glove') # Shape: (500, 300) model = ETM( counts=counts, vocab=vocab, embeddings_mapping=embeddings, num_topics=10, batch_size=32, ) params = model.train_step(num_steps=200, lr=0.01, random_seed=42) # Results use the same API as other models model.summary() top_words = model.return_top_words_per_topic(n=10) beta = model.return_beta() # uses softmax(ρ @ α) categories, e_theta = model.return_topics() # neural encoder inference Loading Pre-trained Embeddings ============================== **From file (GloVe format)**: .. code-block:: python import numpy as np def load_glove_embeddings(filepath, vocab, embedding_dim=300): """Load GloVe embeddings for given vocabulary.""" embeddings = {} with open(filepath, 'r') as f: for line in f: parts = line.split() word = parts[0] if word in vocab: embeddings[word] = np.array(parts[1:], dtype=np.float32) # Create matrix for vocab embedding_matrix = np.zeros((len(vocab), embedding_dim), dtype=np.float32) for i, word in enumerate(vocab): if word in embeddings: embedding_matrix[i] = embeddings[word] else: # Random embedding for OOV words embedding_matrix[i] = np.random.randn(embedding_dim) * 0.1 return embedding_matrix embeddings = load_glove_embeddings('glove.6B.300d.txt', vocab) **From gensim**: .. code-block:: python from gensim.models import Word2Vec # Load Word2Vec model w2v_model = Word2Vec.load('word2vec_model.bin') # Extract embeddings for vocabulary embedding_dim = w2v_model.vector_size embeddings = np.zeros((len(vocab), embedding_dim)) for i, word in enumerate(vocab): if word in w2v_model.wv: embeddings[i] = w2v_model.wv[word] else: embeddings[i] = np.random.randn(embedding_dim) * 0.1 **From fastText**: .. code-block:: python from fastText import load_model model = load_model('fasttext_model.bin') embeddings = np.array([ model.get_word_vector(word) for word in vocab ]).astype(np.float32) Practical Example: News Classification ====================================== .. code-block:: python from poisson_topicmodels import ETM from gensim.models import Word2Vec # Train or load Word2Vec on news corpus w2v = Word2Vec(sentences=tokenized_documents, vector_size=300, window=5) # Create embedding matrix embeddings = np.array([ w2v.wv[word] if word in w2v.wv else np.random.randn(300) * 0.1 for word in vocab ]) # Train ETM model = ETM( counts=counts, vocab=vocab, embeddings_mapping=embeddings, num_topics=15, batch_size=64, ) model.train_step(num_steps=200, lr=0.01) # Inspect topics model.summary() top_words = model.return_top_words_per_topic(n=15) for topic_id, words in top_words.items(): print(f"Topic {topic_id}: {', '.join(words)}") Comparing ETM vs Standard Models ================================= **Quality comparison**: .. code-block:: python # Train multiple models pf_model = PF(counts, vocab, num_topics=10, batch_size=32) pf_model.train_step(num_steps=200, lr=0.01) etm_model = ETM(counts, vocab, embeddings_mapping=embeddings, num_topics=10, batch_size=32) etm_model.train_step(num_steps=200, lr=0.01) # Calculate coherence pf_coherence = pf_model.compute_topic_coherence() etm_coherence = etm_model.compute_topic_coherence() print(f"PF Coherence: {pf_coherence['coherence'].mean():.3f}") print(f"ETM Coherence: {etm_coherence['coherence'].mean():.3f}") # ETM usually has higher coherence **Visual comparison**: .. code-block:: python # Display topic evolution import matplotlib.pyplot as plt fig, axes = plt.subplots(1, 2, figsize=(12, 5)) # PF topics for topic_id in range(5): top_words_pf = pf_model.return_top_words_per_topic(n=5)[topic_id] axes[0].text(0.1, 0.9 - topic_id * 0.15, f"T{topic_id}: {', '.join(top_words_pf)}") axes[0].set_title('PF Topics') # ETM topics for topic_id in range(5): top_words_etm = etm_model.return_top_words_per_topic(n=5)[topic_id] axes[1].text(0.1, 0.9 - topic_id * 0.15, f"T{topic_id}: {', '.join(top_words_etm)}") axes[1].set_title('ETM Topics') plt.tight_layout() plt.show() Advanced: Custom ETM Variants ============================= **Combine with seeding**: .. code-block:: python # ETM with keyword guidance seeds = { 0: ['climate', 'carbon', 'greenhouse'], 1: ['economy', 'market', 'trade'], } # If ETM supports seeding (check documentation) model = ETM( counts=counts, vocab=vocab, embeddings=embeddings, num_topics=10, seeds=seeds, seed_strength=10.0 ) **Combine with covariates**: .. code-block:: python # ETM with covariate effects covariates = np.random.randn(num_docs, 2) # If supported model = ETM( counts=counts, vocab=vocab, embeddings=embeddings, num_topics=10, covariates=covariates ) Embedding Quality Matters ========================== **Good embeddings**: - Trained on large, relevant corpus - Capture domain-specific semantics - Adequate dimensionality (usually 300+) - Word coverage matches your vocabulary **Issues with bad embeddings**: - Random or poorly trained embeddings don't help - May actually hurt performance - Out-of-vocabulary words hurt coverage - Mismatch between corpus domain and embedding domain **Best practices**: 1. Use embeddings trained on similar corpus 2. Check coverage: ``coverage = sum(word in embeddings for word in vocab)`` 3. Verify quality: do related words have similar embeddings? 4. Compare ETM vs PF on your data Troubleshooting ETM =================== **Problem**: ETM doesn't improve over basic PF *Solution*: - Check embedding quality (is coverage good?) - Try different embedding model - Ensure preprocessing matches embedding vocabulary - ETM might not help for all datasets **Problem**: Training is slow *Solution*: - Embeddings add computational cost - Reduce num_topics - Increase batch_size - Reduce vocabulary size - Use GPU **Problem**: Topics look worse than PF *Solution*: - Bad embedding quality - Domain mismatch (embeddings from different corpus) - Embedding dimensionality too low (try higher-dim embeddings) - Training not converged (more iterations) **Problem**: Many OOV (out-of-vocabulary) words *Solution*: - Check embedding file covers words in vocab - Preprocess to match embedding vocabulary - Use subword embeddings (fastText) instead of word-level Evaluation ========== Metrics for ETM: .. code-block:: python # Standard metrics still apply beta = etm_model.return_beta() categories, e_theta = etm_model.return_topics() # Coherence coherence_df = etm_model.compute_topic_coherence() # Topic diversity diversity = etm_model.compute_topic_diversity() print(f"Topic diversity: {diversity:.3f}") # Compare with baseline pf_coherence = pf_model.compute_topic_coherence() improvement = (coherence_df['coherence'].mean() - pf_coherence['coherence'].mean()) / abs(pf_coherence['coherence'].mean()) print(f"ETM improves coherence by {improvement:.1%}") Relationship to Other Models ============================= **ETM vs PF**: Adds semantic constraints through embeddings **ETM + SPF = Guided ETM**: Combine embedding quality with domain guidance **ETM + CPF**: Use embeddings + metadata (if supported) **ETM + TBIP**: Ideal points with better topic discovery When to Stack Models: - Use basic PF first to understand topics - Add embeddings (ETM) if coherence is a concern - Add seeds (SPF) if you have domain knowledge - Add TBIP/CPF if you have additional structure Next Steps ========== - :doc:`../tutorials/index` - Advanced training techniques - :doc:`../how_to_guides/index` - Practical recipes - :doc:`../api/index` - Complete ETM API documentation - Explore different embedding sources for your domain