Topic Modeling Paradigms (machine learning in the concept of NLP)
Canonical: Match a preestablished list of topics for the domain (library of congress, vatican, Thomson Bible Chain Reference - one of the earliest topic modeling before computers, an anointed topic expert).
Organic: Discover the “natural” topics of a corpus. Let topics bubble up from out of the “lake” of unstructured documents. Most popular because it’s a purely statistical approach.
Entity-centric: Topics are strongly related to sets of NEs that may change over time (often people). Established list of named entities (e.g., find all topics related to this list of people’s names).
Canonical Topic Modeling
Match a preestablished list of topics for the domain (library of congress, vatican, Thomson Bible Chain Reference - one of the earliest topic modeling before computers, an anointed topic expert).
Organic Topic Modeling
Discover the “natural” topics of a corpus. Let topics bubble up from out of the “lake” of unstructured documents. Most popular because it’s a purely statistical approach.
LSA: Latent semantic analysis
LDA: Latent dirichlet allocation (not latent discriminant analysis!)
NMF: non-negative matrix factorization
Latent semantic analysis (LSA)
Discovering cluster of words that make a topic across a collection of documents. Sparse vectors, reduce to a smaller # of dimensions from a broad vocabulary to a smaller number of words for each group. Similar to clustering.
Starts with a large term-document matrix
Then creates a topic to topic matrix (how many times did two topics co-occur in the same document)
Chooses ones that do a good job of separating documents. Desirable is a lot of separation.
Example: Fish would likely show up as a topic when looking at menus, but onions would not.
To LSA, a topic is a mix of words that commonly occur together
Latent Dirichlet Allocation (LDA)
-Groups words with high cooccurrence in a corpus of documents
-Output with overlap in keywords
-Uses a probability distributions over words rather than a topic-topic matrix
-formula: topic distribution over the keywords * document distribution over topics
-Random seed, we decide how many topics we want to create
-
Latent Dirichlet Allocation (LDA)
LDA is the KMeans of Topic Modeling - LSA pushed out
Most important aspect of LDA (Alpha, Beta)
Alpha - high value means each document likely to contain a mixture of topics. Low value then a document only contains a few topics.
Beta - high value means each topic is likely to contain a mixture of many words. A low value means that a topic is more likely to contain just a few words.
-Can be controlled independently
-Arguably the most favored method because people like being able to tweak alpha/beta parameters
-High alpha will lead to documents being more similar in terms of what topics they contain.
-A high beta value will similarly lead to topics being more similar in terms of what words they contain.
Define # of k
Non-Negative Matrix Factorization (NMF)
A version of LDA in which parameters have been tweaked to enforce a sparse number o topics. Tends to produce a smaller number of topics really well. Not good for a large number of topics.
Working with Organic Topic Models (LSA, LDA, NMF)
Outputted topics have words like the, a, an, in
Forgot to remove stopwords
Applications of Topic Modelers
Canonical Topic Modeling
Authorized source for topics. We can only select the topics from that list.
Standards org
Boss
Topic expert
How is it different than classification? We want to enable topic-driven exploration of the corpus by end users.
Constrain organic topic model to the canonical list of topics (cut of words that don’t occur in the topic list)
Use an Information Retrieval approach
Extension vs. Intension
Entity-Centric Topic Modeling
Centered around a list of people, e.g., sports, teams, countries, shows, leagues, years, seasons, franchises, etc.
1 way
-Canonical list of named entities (names)
-Context harvesting - try to find topics that come up in the context of those names
-Contextual organic topics
another approach (topic first)
Singular Value Decomposition (SVD)
-Any set of vectors (A) can be expressed in terms of their lengths of projections (S) on some set of orthogonal axes (V)
Lift score
Likelihood of word being search in a super action such as visiting a certain website to rank the words by importance and then threshold it to obtain a set of words highly associated with the context.