How-To Guides

Practical recipes for common topic modeling tasks.

Note

Individual how-to guides for specific tasks are coming soon. For now, please refer to:

Common Topics

Data & Input

  • Loading text files and creating document-term matrices

  • Tokenization, cleaning, and preprocessing

  • Working with sparse matrix formats

  • Creating and managing vocabularies

Training & Configuration

  • Mini-batch vs full-batch training

  • GPU acceleration and multi-GPU setups

  • Hyperparameter tuning and validation

  • Reproducibility with seeds

Results & Analysis

  • Extracting topic distributions and top words

  • Interpreting and visualizing results

  • Model evaluation and coherence metrics

  • Exporting results for downstream analysis

Troubleshooting

  • Handling data issues and edge cases

  • GPU memory problems

  • Training failures and convergence issues

  • Improving topic quality

Tips & Best Practices

General Workflow

  1. Prepare Data: Load and clean text, create document-term matrix

  2. Train Model: Start with basic PF before trying advanced variants

  3. Evaluate: Check topic quality with metrics and manual inspection

  4. Extract & Analyze: Get top words, distributions, and visualizations

  5. Improve: Adjust hyperparameters or model type if needed

Performance Optimization

  • Use GPU for datasets with >100k documents (see Tutorial: GPU Acceleration)

  • Filter rare words to reduce vocabulary size

  • Use sparse matrices for large inputs

  • Start with fewer topics for testing, then scale up

Model Selection

  • PF: Start here - simple, unsupervised baseline

  • SPF: Use if you have domain knowledge (keywords/seeds)

  • CPF/CSPF: Use if documents have metadata (authors, dates, etc.)

  • ETM: Use if you have pre-trained word embeddings

  • TBIP: Use for discovering ideological positions (political text)

  • STBS: Use for topic-specific ideological positions with author-level covariates

Learn More