.. _getting_started:

================================================================================
Getting Started
================================================================================

Welcome to poisson-topicmodels! This guide will get you up and running in about 5 minutes.

Quickstart: Your First Topic Model
===================================

Let's walk through a complete example from data preparation to model interpretation.

Step 1: Import Required Libraries
----------------------------------

.. code-block:: python

   import numpy as np
   from scipy.sparse import csr_matrix
   from poisson_topicmodels import PF

Step 2: Prepare Your Data
--------------------------

Topic models work with a **document-term matrix** (documents × vocabulary terms) and
a **vocabulary** list.

.. code-block:: python

   # Create sample data: 100 documents, 500 vocabulary terms
   # In practice, you would load your own text data
   np.random.seed(42)
   counts = csr_matrix(np.random.poisson(2, (100, 500)).astype(np.float32))
   vocab = np.array([f'word_{i}' for i in range(500)])

   print(f"Document-term matrix shape: {counts.shape}")
   print(f"Vocabulary size: {len(vocab)}")

**Data format**: The document-term matrix should be a sparse matrix (rows = documents,
columns = vocabulary terms) with non-negative integer counts.

Step 3: Initialize and Train the Model
---------------------------------------

.. code-block:: python

   # Create a Poisson Factorization model with 10 topics
   model = PF(
       counts=counts,
       vocab=vocab,
       num_topics=10,
       batch_size=32,
       random_seed=42
   )

   # Train for 200 steps with learning rate 0.01
   params = model.train_step(
       num_steps=200,
       lr=0.01
   )

Step 4: Extract and Interpret Results
--------------------------------------

.. code-block:: python

   # Quick summary of the fitted model
   model.summary()

   # Get word-topic associations
   beta = model.return_beta()  # DataFrame: words × topics
   print(f"Beta shape: {beta.shape}")

   # Get top words for each topic
   top_words = model.return_top_words_per_topic(n=10)
   print("\nTop 10 words per topic:")
   for topic_id, words in top_words.items():
       print(f"Topic {topic_id}: {', '.join(words)}")

   # Get document-topic distributions
   categories, e_theta = model.return_topics()
   print(f"\nDocument-topic matrix shape: {e_theta.shape}")
   print(f"Dominant topic for first doc: {categories[0]}")

Understanding the Output
=========================

**Beta** (``return_beta()``)
   DataFrame of shape (vocabulary_size, num_topics).

   Each column is a topic: word-level association weights.

**Topics** (``return_topics()``)
   Returns ``(categories, E_theta)`` — dominant topic per document and
   the full document-topic proportions matrix.

**Top Words** (``return_top_words_per_topic(n)``)
   A dict mapping topic identifiers to their top-n words, useful for interpretation.

**Summary** (``summary()``)
   Prints a formatted overview of the model: loss, top words, and model-specific details.

Next Steps
==========

Now that you have a working model, explore:

1. **Model Variants**: :doc:`../fundamentals/index`
   Explore all models like seeded PF (SPF) for guided discovery.

2. **Training & Configuration**: :doc:`../tutorials/index`
   Understand training options, hyperparameters, and GPU acceleration.

3. **Practical Recipes**: :doc:`../how_to_guides/index`
   Common tasks and advanced workflows.

4. **Advanced Usage**: :doc:`../how_to_guides/index`
   Extract results, customize inference, and integrate with your pipeline.

Complete Example with Real-ish Data
====================================

Here's a more realistic example with synthetic documents that have meaningful structure:

.. code-block:: python

   import numpy as np
   from scipy.sparse import csr_matrix
   from poisson_topicmodels import PF

   # Create synthetic documents with 3 underlying topics
   np.random.seed(42)
   num_docs = 200
   num_words = 1000
   num_topics = 3

   # Define some "topic-specific" words
   topic_words = {
       0: list(range(0, 100)),      # Words 0-99 for topic 0
       1: list(range(100, 200)),    # Words 100-199 for topic 1
       2: list(range(200, 300)),    # Words 200-299 for topic 2
   }

   # Generate documents with topic structure
   counts_list = []
   for doc_id in range(num_docs):
       # Each document is a mixture of topics
       topic_dist = np.random.dirichlet([1] * num_topics)
       word_counts = np.zeros(num_words)

       for topic_id in range(num_topics):
           topic_weight = topic_dist[topic_id]
           words_in_topic = topic_words[topic_id]
           for _ in range(int(50 * topic_weight)):
               word_id = np.random.choice(words_in_topic)
               word_counts[word_id] += 1

       counts_list.append(word_counts)

   counts = csr_matrix(np.array(counts_list).astype(np.float32))
   vocab = np.array([f'word_{i}' for i in range(num_words)])

   # Train model with matching number of topics
   model = PF(counts, vocab, num_topics=3, batch_size=32)
   params = model.train_step(num_steps=200, lr=0.01, random_seed=42)

   # The model should discover the 3 underlying topics
   model.summary()
   top_words = model.return_top_words_per_topic(n=20)
   for topic_id, words in top_words.items():
       print(f"\nTopic {topic_id}:")
       print(f"  {', '.join(words[:10])}")

Key Concepts
============

**Document-Term Matrix**
   The core input format: a sparse matrix where rows are documents and columns are
   vocabulary terms, containing word counts.

**Topics**
   Latent variables representing abstract themes. Each topic is a distribution over words.

**Topic Modeling**
   Statistical technique to discover and analyze latent topics in text data.

**Stochastic Variational Inference (SVI)**
   Efficient training method that processes documents in small batches (mini-batch training).

**GPU Acceleration**
   Computations run on GPU (if available) for significant speedups on large datasets.

Common Parameters
=================

**num_topics**: Number of topics to discover

**batch_size**: Documents processed per training step

**num_iterations**: Training iterations

**learning_rate**: Step size for optimization

**random_seed**: For reproducibility

Tips for Best Results
=====================

1. **Tune the Number of Topics**: Start with 10-20 topics, adjust based on coherence
2. **Use a Good Batch Size**: Larger batches (256+) for stability, smaller (32) for faster iterations
3. **Monitor Training**: Check that loss decreases smoothly
4. **Validate Topics**: Read top words to verify topics make sense
5. **Use GPU**: If available, GPU acceleration provides 10-100x speedup
6. **Set Random Seed**: For reproducibility in research

What's Next?
============

- **Read more examples**: See :doc:`../examples_guide/index`
- **Explore other models**: Check :doc:`../fundamentals/index` for seeded, covariate, and embedded variants
- **Learn advanced techniques**: Visit :doc:`../tutorials/index`
- **Check the API**: Refer to :doc:`../api/index` for detailed documentation

Having Issues?
==============

- Check :doc:`../installation/index` for installation troubleshooting
- Read :doc:`../how_to_guides/index` for common tasks
- Explore the full :doc:`../api/index` reference