.. _tutorial_validation:

================================================================================
Tutorial: Model Validation & Evaluation
================================================================================

How to assess the quality of your trained topic models.

**Duration**: ~10 minutes
**Prerequisites**: :doc:`tutorial_training`

Validation Approaches
=====================

Three complementary approaches to validate models:

1. **Qualitative**: Manual inspection of topics
2. **Quantitative**: Metrics (coherence, perplexity)
3. **Downstream**: Performance on actual tasks

The Coherence Metric
====================

Coherence measures if top words of a topic are semantically related.

.. code-block:: python

   coherence_df = model.compute_topic_coherence()
   coherence = coherence_df['coherence'].values
   print(f"Coherence per topic: {coherence}")
   print(f"Average: {coherence.mean():.3f}")

   # Topic diversity: are topics distinct?
   diversity = model.compute_topic_diversity()
   print(f"Topic diversity: {diversity:.3f}")
   # 1.0 = all unique words, 0.0 = identical top words

Interpreting values:

- **0.6+**: Excellent coherence (words form clear themes)
- **0.4-0.6**: Good (topics are interpretable)
- **0.2-0.4**: Fair (some coherence but noisy)
- **<0.2**: Poor (topics are incoherent)

Finding low-coherence topics:

.. code-block:: python

   worst_topics = np.argsort(coherence)[:5]
   top_words = model.return_top_words_per_topic(n=10)
   for topic_id in worst_topics:
       words = top_words[topic_id]
       print(f"Topic {topic_id} (coherence={coherence[topic_id]:.3f}):")
       print(f"  {', '.join(words)}")

Qualitative Inspection
======================

Manual topic interpretation:

.. code-block:: python

   def evaluate_topics_manually(model, num_to_show=10):
       """Inspect top words for each topic."""
       top_words = model.return_top_words_per_topic(n=20)

       ratings = {}

       for topic_id, words in top_words.items():
           print(f"\n=== Topic {topic_id} ===")
           print(f"Top words: {', '.join(words[:10])}")

           # Rate quality: 1=bad, 2=poor, 3=fair, 4=good, 5=excellent
           rating = input("Rate this topic (1-5, q=quit): ")
           if rating.lower() == 'q':
               break
           if rating.isdigit():
               ratings[topic_id] = int(rating)

       return ratings

   # Use it
   ratings = evaluate_topics_manually(model)
   avg_rating = np.mean(list(ratings.values()))
   print(f"\nAverage rating: {avg_rating:.1f} / 5")

Checklist for topic quality:

.. code-block::

   ✓ Top words form coherent theme
   ✓ You can give topic a meaningful label
   ✓ Topic isn't all stopwords or common terms
   ✓ Topic doesn't duplicate another topic
   ✓ Topic isn't a garbage catch-all

Comparative Evaluation
======================

Compare multiple model configurations:

.. code-block:: python

   results = {}

   # Try different numbers of topics
   for num_topics in [5, 10, 20, 50]:
       model = PF(counts, vocab, num_topics=num_topics, batch_size=32)
       model.train_step(num_steps=200, lr=0.01)

       coherence_df = model.compute_topic_coherence()
       coherence = coherence_df['coherence'].values
       results[num_topics] = {
           'coherence_mean': coherence.mean(),
           'coherence_std': coherence.std(),
           'diversity': model.compute_topic_diversity(),
           'model': model
       }

   # Display results
   print("Performance by number of topics:")
   for k, v in results.items():
       print(f"  K={k}: coherence={v['coherence_mean']:.3f} ± {v['coherence_std']:.3f}")

   # Pick best and visualize
   best_k = max(results, key=lambda x: results[x]['coherence_mean'])
   print(f"\nBest configuration: {best_k} topics")

Downstream Task Evaluation
===========================

If you have a downstream task, evaluate model performance there:

.. code-block:: python

   # Example: Use topics for document classification
   from sklearn.ensemble import RandomForestClassifier

   # Get document-topic representations
   doc_topics_result = model.return_topics()
   _, e_theta = doc_topics_result

   # Train classifier on topics
   clf = RandomForestClassifier()
   clf.fit(e_theta, labels)  # labels = ground truth

   # Evaluate
   accuracy = clf.score(doc_topics, labels)
   print(f"Classification accuracy: {accuracy:.3f}")

   # Compare with other models
   results['model_quality'] = accuracy

Topic Similarity Analysis
==========================

Are topics overlapping? Check similarity:

.. code-block:: python

   from sklearn.metrics.pairwise import cosine_similarity

   beta = model.return_beta()
   similarity = cosine_similarity(beta.values.T)
   np.fill_diagonal(similarity, 0)

   # Find similar pairs
   similar = np.where(similarity > 0.7)
   top_words = model.return_top_words_per_topic(n=5)
   for i, j in zip(similar[0], similar[1]):
       if i < j:
           print(f"Topic {i} and {j} are similar (sim={similarity[i, j]:.3f})")
           print(f"  Topic {i}: {', '.join(top_words[i])}")
           print(f"  Topic {j}: {', '.join(top_words[j])}")

Document Coverage
=================

Do all documents get meaningful topic assignments?

.. code-block:: python

   _, e_theta = model.return_topics()

   # Topic concentration per document
   doc_entropy = -np.sum(e_theta * np.log(e_theta + 1e-10), axis=1)
   max_probability = e_theta.max(axis=1)

   print(f"Document topic concentration:")
   print(f"  Max topic probability: {max_probability.mean():.3f} ± {max_probability.std():.3f}")
   print(f"  Entropy: {doc_entropy.mean():.3f} ± {doc_entropy.std():.3f}")

   # Low entropy = document in few topics (concentrated)
   # High entropy = document spread across topics (diffuse)

   # Are we getting good coverage?
   if max_probability.mean() < 0.3:
       print("Warning: Documents don't concentrate on topics")
       print("  → Consider increasing num_topics or more training")

Visualization for Validation
=============================

.. code-block:: python

   import matplotlib.pyplot as plt

   fig, axes = plt.subplots(2, 2, figsize=(12, 10))

   # 1. Coherence distribution
   coherence_df = model.compute_topic_coherence()
   coherence = coherence_df['coherence'].values
   axes[0, 0].hist(coherence, bins=20, edgecolor='black')
   axes[0, 0].set_xlabel('Coherence')
   axes[0, 0].set_ylabel('Number of topics')
   axes[0, 0].set_title('Topic Coherence Distribution')
   axes[0, 0].axvline(coherence.mean(), color='red', linestyle='--', label='Mean')
   axes[0, 0].legend()

   # 2. Topic prevalence (or use built-in: model.plot_topic_prevalence())
   _, e_theta = model.return_topics()
   avg_topics = e_theta.mean(axis=0)
   axes[0, 1].bar(range(len(avg_topics)), avg_topics)
   axes[0, 1].set_xlabel('Topic ID')
   axes[0, 1].set_ylabel('Average intensity')
   axes[0, 1].set_title('Topic Prevalence')

   # 3. Document entropy
   doc_entropy = -np.sum(e_theta * np.log(e_theta + 1e-10), axis=1)
   axes[1, 0].hist(doc_entropy, bins=30, edgecolor='black')
   axes[1, 0].set_xlabel('Entropy')
   axes[1, 0].set_ylabel('Number of documents')
   axes[1, 0].set_title('Document Topic Dispersion')

   # 4. Top vs average coherence
   top_topics = np.argsort(coherence)[-5:]
   bottom_topics = np.argsort(coherence)[:5]
   axes[1, 1].barh(range(5), coherence[bottom_topics], alpha=0.5, label='Worst')
   axes[1, 1].barh(range(5, 10), coherence[top_topics], alpha=0.5, label='Best')
   axes[1, 1].set_yticks(range(10))
   axes[1, 1].set_yticklabels(list(bottom_topics) + list(top_topics))
   axes[1, 1].set_xlabel('Coherence')
   axes[1, 1].set_title('Best vs Worst Topics')
   axes[1, 1].legend()

   plt.tight_layout()
   plt.show()

Validation Checklist
====================

Before deploying a model:

✓ Average coherence > 0.4
✓ No garbage topics (all stopwords)
✓ Topics aren't highly overlapping
✓ Manual inspection: topics make sense
✓ Downstream task performance acceptable
✓ Coverage: documents get meaningful topics
✓ Reproducibility: same seed → same results

Red Flags
=========

**Model probably needs improvement if**:

- ❌ Most topics have low coherence (<0.3)
- ❌ Can't label most topics meaningfully
- ❌ Many topics are duplicates
- ❌ Some topics are all stopwords/garbage
- ❌ Downstream task performance is poor
- ❌ Many documents have flat topic distribution

**Next steps when validation fails**:

1. Try more training iterations
2. Adjust learning rate
3. Change number of topics
4. Improve data preprocessing
5. Try guided/seeded variant (SPF)
6. Add covariates if available (CPF)

See :doc:`tutorial_hyperparameters` for optimization strategies.

Validation Workflow
===================

.. code-block:: text

   1. Train model with initial config
   2. Compute coherence
   3. Visualize and inspect topics
   4. Check for duplicates
   5. Evaluate downstream performance

   If quality acceptable: ✓ Done
   If not:
   6. Adjust configuration
   7. Retrain and repeat from 2

Version Tracking
================

Keep records of model evaluations:

.. code-block:: python

   import json
   from datetime import datetime

   def save_evaluation(model_name, config, results):
       """Save model evaluation results."""
       eval_record = {
           'timestamp': datetime.now().isoformat(),
           'model_name': model_name,
           'config': config,
           'results': {
               'mean_coherence': float(results['coherence'].mean()),
               'std_coherence': float(results['coherence'].std()),
               'num_low_quality': results.get('low_quality_count', 0),
           }
       }

       with open('evaluations.json', 'a') as f:
           f.write(json.dumps(eval_record) + '\n')

   # Use it
   config = {'num_topics': 20, 'learning_rate': 0.01}
   save_evaluation('pf_model_v1', config, {'coherence': coherence})

Next Steps
==========

- Satisfied? Move to :doc:`tutorial_hyperparameters` for fine-tuning
- Need to optimize? See :doc:`../how_to_guides/index`
- Want production-ready? Check :doc:`../contributing_guide/index` for best practices