.. _tutorial_training:

================================================================================
Tutorial: Training Your First Topic Model
================================================================================

This tutorial covers the complete workflow for training a topic model and interpreting results.

**Duration**: ~15 minutes
**Level**: Beginner
**Prerequisites**: :doc:`../getting_started/index`

Step 1: Prepare Your Data
==========================

First, organize your text data into a document-term matrix.

.. code-block:: python

   import numpy as np
   from scipy.sparse import csr_matrix
   from poisson_topicmodels import PF

   # Option A: Load pre-processed data
   # For this tutorial, create synthetic data
   np.random.seed(42)

   # Document-term matrix: 500 documents, 1000 vocabulary
   num_docs, num_words = 500, 1000
   # Synthetic: ~10 words per document on average
   counts = csr_matrix(np.random.poisson(1, (num_docs, num_words)).astype(np.float32))

   # Vocabulary is list of words
   vocab = np.array([f'word_{i}' for i in range(num_words)])

   print(f"Dataset shape: {counts.shape}")
   print(f"Sparsity: {1 - counts.nnz / (counts.shape[0] * counts.shape[1]):.1%}")

Step 2: Initialize the Model
=============================

Create a topic model with initial configuration.

.. code-block:: python

   # Start with a reasonable number of topics
   num_topics = 15  # Adjust based on your data and goals

   model = PF(
       counts=counts,
       vocab=vocab,
       num_topics=num_topics,
       batch_size=64,  # 64 documents per training batch
       random_seed=42  # for reproducibility
   )

   print(f"Initialized PF model with {num_topics} topics")

Step 3: Train the Model
=======================

Run inference to learn topic and document-topic distributions.

.. code-block:: python

   # Train for 200 steps with moderate learning rate
   params = model.train_step(
       num_steps=200,
       lr=0.01,
   )

   print("Training complete!")

**Monitor training**: Watch the loss values. Should decrease steadily then plateau.

**Time expectations**:
- 500 docs × 1000 words: ~30 seconds on CPU, ~5 seconds on GPU
- Larger datasets: scale accordingly

Step 4: Extract Results
=======================

Get the learned topics and document-topic distributions.

.. code-block:: python

   # 1. Get word-topic associations
   beta = model.return_beta()  # DataFrame: vocab_size × num_topics
   print(f"Beta shape: {beta.shape}")

   # 2. Get document-topic distributions
   categories, e_theta = model.return_topics()
   print(f"Document-topic shape: {e_theta.shape}")

   # 3. Get top words per topic
   top_words = model.return_top_words_per_topic(n=15)
   print(f"Number of topics: {len(top_words)}"))

Step 5: Interpret Topics
=========================

Human interpretation is crucial for evaluating topic quality.

.. code-block:: python

   # Print the model summary first
   model.summary()

   # Display top words for each topic
   print("=" * 60)
   print("DISCOVERED TOPICS")
   print("=" * 60)

   top_words = model.return_top_words_per_topic(n=15)
   for topic_id, words in top_words.items():
       print(f"\nTopic {topic_id}:")
       print(f"  Top words: {', '.join(words[:10])}")

   # Ask yourself:
   # - Do the top words form a coherent theme?
   # - Can you give each topic a human-readable label?
   # - Do any topics look like garbage?
   # - Are any topics too similar?

Step 6: Analyze Document-Topic Distribution
=============================================

Understand how topics are distributed across documents.

.. code-block:: python

   import matplotlib.pyplot as plt

   # 1. Topic distribution in specific documents
   categories, e_theta = model.return_topics()
   doc_0_topics = e_theta[0]
   print(f"\nDocument 0 topic distribution:")
   sorted_topics = np.argsort(doc_0_topics)[::-1]
   for i, topic_id in enumerate(sorted_topics[:5]):
       intensity = doc_0_topics[topic_id]
       print(f"  Topic {topic_id}: {intensity:.3f}")

   # 2. Overall topic prevalence in corpus (built-in plot)
   model.plot_topic_prevalence()

   # 3. Find most interesting documents
   doc_entropy = -np.sum(e_theta * np.log(e_theta + 1e-10), axis=1)
   focused_docs = np.argsort(doc_entropy)[:5]
   scattered_docs = np.argsort(doc_entropy)[-5:]

   print(f"\nMost focused documents (highest topic concentration): {focused_docs}")
   print(f"Most scattered documents (most mixed): {scattered_docs}")

Step 7: Advanced Analysis
=========================

Deeper exploration of results:

.. code-block:: python

   # 1. Topic similarity (built-in heatmap)
   model.plot_topic_correlation()

   # Or compute manually:
   from sklearn.metrics.pairwise import cosine_similarity
   beta = model.return_beta()
   topic_similarity = cosine_similarity(beta.values.T)
   np.fill_diagonal(topic_similarity, 0)  # Remove self-similarity

   # Find most similar topic pairs
   for _ in range(3):
       i, j = np.unravel_index(topic_similarity.argmax(), topic_similarity.shape)
       if topic_similarity[i, j] > 0:
           print(f"Topic {i} and {j} are similar (sim={topic_similarity[i, j]:.3f})")
       topic_similarity[i, j] = 0

   # 2. Topic specialization
   # How many topics does each document use mainly?
   _, e_theta = model.return_topics()
   doc_dominance = (e_theta.max(axis=1) / e_theta.sum(axis=1))
   print(f"\nAverage document topic dominance: {doc_dominance.mean():.3f}")
   print(f"  → Values close to 1: documents focus on few topics")
   print(f"  → Values close to {1/num_topics:.3f}: documents spread across topics")

Step 8: Quality Metrics
=======================

Evaluate model quality programmatically.

.. code-block:: python

   # Coherence: do top words of a topic correlate?
   coherence_df = model.compute_topic_coherence()
   coherence = coherence_df['coherence'].values
   print(f"Topic coherence (per topic):")
   print(f"  Mean: {coherence.mean():.3f}")
   print(f"  Std: {coherence.std():.3f}")
   print(f"  Range: [{coherence.min():.3f}, {coherence.max():.3f}]")

   # Topic diversity
   diversity = model.compute_topic_diversity()
   print(f"\nTopic diversity: {diversity:.3f}")

   # Which topics are most coherent?
   best_topics = np.argsort(coherence)[-5:]
   worst_topics = np.argsort(coherence)[:5]
   print(f"\nMost coherent topics: {best_topics}")
   print(f"Least coherent topics: {worst_topics}")

Next: Validation and Optimization
==================================

Your trained model is done! Now consider:

1. **Validate model quality** → :doc:`tutorial_validation`
2. **Optimize hyperparameters** → :doc:`tutorial_hyperparameters`
3. **Scale to bigger data** → :doc:`tutorial_gpu`
4. **Solve specific problems** → :doc:`../how_to_guides/index`

Quick Checklist
===============

✓ Data loaded and formatted as document-term matrix
✓ Model initialized with reasonable parameters
✓ Training completed and loss decreased
✓ Topics extracted and interpreted
✓ Document-topic distributions explored
✓ Quality metrics computed

What's Next?

- **Improve results**: Try :doc:`tutorial_validation` to assess quality
- **Lots of data?**: See :doc:`tutorial_gpu` for GPU acceleration
- **Fine-tune model**: Read :doc:`tutorial_hyperparameters`
- **Specific task?**: Browse :doc:`../how_to_guides/index`

Common Issues
=============

**Q: Loss isn't decreasing**
A: Try higher learning rate (0.05-0.1) or reduce batch size

**Q: Topics look random**
A: You may need more topics or more training iterations

**Q: Training is really slow**
A: Use GPU (see :doc:`tutorial_gpu`) or reduce vocabulary size

**Q: Memory error**
A: Reduce batch_size or use sparse matrix format

See :doc:`../fundamentals/index` for more details on each model.