.. _poisson_factorization: ================================================================================ Poisson Factorization (PF) ================================================================================ The **Poisson Factorization (PF)** model is the foundational, unsupervised topic model in poisson-topicmodels. It automatically discovers topics without any prior guidance. When to Use PF ============== Use Poisson Factorization when: ✓ You want to discover topics without prior knowledge ✓ You have document-term matrices as input ✓ You need interpretable word-topic associations ✓ You want a fast baseline model ✓ You're exploring new document collections Consider other models if: ✗ You have prior knowledge about expected topics (→ use SPF) ✗ You have document-level metadata (→ use CPF/CSPF) ✗ You need to estimate author positions (→ use TBIP/STBS) ✗ You want to leverage pre-trained embeddings (→ use ETM) The Model ========= **Generative Process**: For a corpus with D documents and V vocabulary terms: 1. For each document d: - Draw document-topic intensity: $\theta_d \sim \text{Gamma}(\alpha, \alpha)^K$ - For each word position in document: - Draw topic: $z_n \sim \text{Discrete}(\text{softmax}(\theta_d))$ - Draw word: $w_n \sim \text{Discrete}(\beta_{z_n})$ 2. For each topic k: - Draw topic-word distribution: $\beta_k \sim \text{Dirichlet}(\eta)$ **Key Properties**: - **Unsupervised**: No labels or guidance needed - **Flexible**: Works with any document collection - **Interpretable**: Topics are directly interpretable as word distributions - **Scalable**: Mini-batch SVI enables large-scale inference Example: Basic Usage ==================== .. code-block:: python import numpy as np from scipy.sparse import csr_matrix from poisson_topicmodels import PF # Prepare data counts = csr_matrix(np.random.poisson(2, (100, 500)).astype(np.float32)) vocab = np.array([f'word_{i}' for i in range(500)]) # Create model model = PF( counts=counts, vocab=vocab, num_topics=10, batch_size=32, random_seed=42 ) # Train params = model.train_step( num_steps=200, lr=0.01 ) # Extract results categories, e_theta = model.return_topics() # dominant topic + proportions beta = model.return_beta() # word-topic DataFrame top_words = model.return_top_words_per_topic(n=10) # top words per topic Interpreting Results ==================== **Word-Topic Matrix (β)**: Access via ``model.return_beta()`` — a ``pd.DataFrame`` (vocab_size × num_topics) Each column k represents topic k: .. code-block:: python beta = model.return_beta() topic_5 = beta.iloc[:, 5] # Topic 5 weights over vocabulary Interpretation: - Higher weight means the word is more associated with the topic - Top words with highest weights characterize the topic **Document Topics (θ)**: Access via ``model.return_topics()`` — returns ``(categories, E_theta)`` .. code-block:: python categories, e_theta = model.return_topics() # categories: dominant topic per document # e_theta: full document-topic matrix (num_docs × num_topics) doc_3_topics = e_theta[3, :] # Topic mixture for document 3 dominant = categories[3] # Dominant topic for document 3 Interpretation: - ``categories`` gives the argmax topic for each document - ``e_theta[d, k]`` is the intensity of topic k in document d **Top Words**: Access via ``model.return_top_words_per_topic(n=10)`` .. code-block:: python top_words = model.return_top_words_per_topic(n=10) # dict: {topic_id: [word1, word2, ...]} print(top_words[2]) # ['research', 'data', 'experiment', 'analysis', ...] Human Interpretation: - Read top ~10-20 words for each topic - Does the topic make sense thematically? - Can you give it a meaningful label? Example Interpretation Workflow =============================== .. code-block:: python model = PF(counts, vocab, num_topics=5, batch_size=32) model.train_step(num_steps=200, lr=0.01) # Examine each topic top_words = model.return_top_words_per_topic(n=15) for topic_id, words in top_words.items(): print(f"\n=== Topic {topic_id} ===") print(f"Top words: {', '.join(words)}") # Find documents dominated by this topic categories, e_theta = model.return_topics() top_docs = np.argsort(e_theta[:, topic_id])[-3:] print(f"Top documents: {top_docs}") Hyperparameter Selection ======================== **Number of Topics (K)** Start with 10-20 topics. Adjust based on: - **Coherence**: Do top words form meaningful themes? - **Interpretability**: Can you label each topic? - **Downstream task**: Does it improve your application? .. code-block:: python # Try different numbers of topics for k in [5, 10, 20, 50]: model = PF(counts, vocab, num_topics=k, batch_size=32) model.train_step(num_steps=200, lr=0.01) # Evaluate quality (e.g., via coherence or manual inspection) **Learning Rate (lr)** Controls optimization step size. Default: 0.01 - **0.001**: Very conservative, slow convergence - **0.01**: Standard, good for most cases - **0.1**: Aggressive, may overshoot - **1.0+**: Usually too large, diverges Recommended: Start with 0.01, adjust if needed **Batch Size** Controls documents per iteration. Default: 32 - **16/32**: Small, noisier gradients, fast iterations - **64/128**: Medium, balanced gradients, standard - **256/512**: Large, stable gradients, slower iterations Recommended: 32-128 for balance **Iterations** How long to train. Default: 100 Monitor loss: .. code-block:: python params = model.train_step( num_steps=200, lr=0.01, ) # Then inspect: model.plot_model_loss() Suggested: Train until loss plateaus (visual inspection) Training Tips ============= **Use GPU**: Set JAX to use GPU for 10-100x speedup .. code-block:: bash # In Python, set before importing JAX export JAX_PLATFORMS=gpu python script.py **Reproducibility**: Set random seed .. code-block:: python params = model.train_step(num_steps=200, lr=0.01, random_seed=42) # Same seed → same results (good for research) **Progress Monitoring**: Check loss trajectory .. code-block:: python model.train_step(num_steps=200, lr=0.01) model.plot_model_loss() # visualize loss curve **Early stopping**: Stop if loss plateaus .. code-block:: python # Check loss after training model.train_step(num_steps=200, lr=0.01) model.plot_model_loss() # visually inspect convergence # If loss hasn't plateaued, train more steps Common Issues and Solutions ============================ **Problem**: Topics look similar or contain generic words *Solution*: - Could be too many topics - reduce K - Improve preprocessing (better stopword removal) - Look at more top words to find differences **Problem**: Some topics are all garbage words *Solution*: - Preprocess better (remove URLs, unicode artifacts, numbers) - Reduce number of topics - Check document-term matrix for data issues **Problem**: Training is slow *Solution*: - Use GPU: ``export JAX_PLATFORMS=gpu`` - Increase batch size (more docs per iteration) - Reduce vocabulary size (remove rare words) **Problem**: Loss not decreasing *Solution*: - Increase learning rate (try 0.05-0.1) - Check data: ensure proper document-term matrix format - Try different random seed Evaluation Metrics ================== See :doc:`../api/index` for available metrics: - **Coherence**: Do top words of a topic correlate? - **Topic diversity**: Are topics distinct? Example: .. code-block:: python coherence_df = model.compute_topic_coherence() print(f"Average coherence: {coherence_df['coherence'].mean():.3f}") diversity = model.compute_topic_diversity() print(f"Topic diversity: {diversity:.3f}") # Built-in visualizations model.plot_model_loss() # Training loss curve model.plot_topic_prevalence() # Topic prevalence bar chart model.plot_topic_correlation() # Topic similarity heatmap Next Steps ========== - **Add guidance**: Use :doc:`seeded_models` to incorporate domain knowledge - **Model metadata**: Try :doc:`covariate_models` if you have document attributes - **Advanced**: Explore :doc:`../tutorials/index` for advanced topics - **API details**: See :doc:`../api/index` for full documentation