What is Topic Modeling?
Topic modeling is an analytic method for identifying the topics that best describe the information in a collection of documents.
What is a Topic?
Topic modeling provides us with methods that allow us to:
What are mixture models?
Mixture models are probabilistic models for representing
the presence of sub-populations within an overall
population, without requiring that an observed dataset
should identify the sub-population to which an individual
observation belongs.
Topic models are mixture models.
Several approaches to Topic Modeling include:
Define Latent Dirichlet Allocation
What are Dirichlet Distributions?
What assumptions does Latent Dirichlet Allocation make?
What does LDA do?
Therefore, given a corpus of documents, in order to identify the k-topics in each document, and the word distribution for each topic, LDA backtracks from the document level to identify the words and topics that are likely to have generated the corpus.
What does the LDA() function do?
How do you find the best number of topics or best value of K?
Several approaches:
Topic coherence.
Quantitive measures of fit.
- Log-likelihood
- Perplexity
What is topic coherence?
What is Log-likelihood?
What is Perplexity?
Method for finding the best K
Using the quantitative measures of fit, to find the best value for “k”, we:
Strengths of LDA?
Weaknesses of LDA?
- Must know the number of topics in advance. - Dirichlet topic distribution cannot capture correlations among topics.
Some applications of Topic Modeling: