Semantic Analysis: Topic Modeling Flashcards

Question 1

Q

Topic Modeling Paradigms (machine learning in the concept of NLP)

Answer

A

Canonical: Match a preestablished list of topics for the domain (library of congress, vatican, Thomson Bible Chain Reference - one of the earliest topic modeling before computers, an anointed topic expert).
Organic: Discover the “natural” topics of a corpus. Let topics bubble up from out of the “lake” of unstructured documents. Most popular because it’s a purely statistical approach.
Entity-centric: Topics are strongly related to sets of NEs that may change over time (often people). Established list of named entities (e.g., find all topics related to this list of people’s names).

Question 2

Q

Canonical Topic Modeling

Answer

A

Match a preestablished list of topics for the domain (library of congress, vatican, Thomson Bible Chain Reference - one of the earliest topic modeling before computers, an anointed topic expert).

Question 3

Q

Organic Topic Modeling

Answer

A

Discover the “natural” topics of a corpus. Let topics bubble up from out of the “lake” of unstructured documents. Most popular because it’s a purely statistical approach.

LSA: Latent semantic analysis
LDA: Latent dirichlet allocation (not latent discriminant analysis!)
NMF: non-negative matrix factorization

Question 4

Q

Latent semantic analysis (LSA)

Answer

A

Discovering cluster of words that make a topic across a collection of documents. Sparse vectors, reduce to a smaller # of dimensions from a broad vocabulary to a smaller number of words for each group. Similar to clustering.

Starts with a large term-document matrix
Then creates a topic to topic matrix (how many times did two topics co-occur in the same document)

Chooses ones that do a good job of separating documents. Desirable is a lot of separation.

Example: Fish would likely show up as a topic when looking at menus, but onions would not.

To LSA, a topic is a mix of words that commonly occur together

Question 5

Q

Latent Dirichlet Allocation (LDA)

Answer

A

-Groups words with high cooccurrence in a corpus of documents
-Output with overlap in keywords
-Uses a probability distributions over words rather than a topic-topic matrix
-formula: topic distribution over the keywords * document distribution over topics
-Random seed, we decide how many topics we want to create
-

Question 6

Q

Latent Dirichlet Allocation (LDA)

Answer

A

LDA is the KMeans of Topic Modeling - LSA pushed out

Groups words with high cooccurrence in a corpus of documents
Output with overlap in keywords
Uses a probability distributions over words rather than a topic-topic matrix
formula: topic distribution over the keywords * document distribution over topics
Random seed, we decide how many topics we want to create
Assigns a random topic to every word in a document
Figure out the proportion of words in document d assigned to topic z for every word in every document. Then figure out the proportion of assignments to topic z among all docs having word k. Then reassign word k in document d to whatever topic has the highest score from the first two steps. Repeat steps again and again. Everything has to be recomputed over and over.
Magic! With each iteration words start to get assigned to the same topic until the are more correct.
Two concentration parameters that control how many things are going to be in a topic and the overlap

Most important aspect of LDA (Alpha, Beta)
Alpha - high value means each document likely to contain a mixture of topics. Low value then a document only contains a few topics.
Beta - high value means each topic is likely to contain a mixture of many words. A low value means that a topic is more likely to contain just a few words.
-Can be controlled independently
-Arguably the most favored method because people like being able to tweak alpha/beta parameters
-High alpha will lead to documents being more similar in terms of what topics they contain.
-A high beta value will similarly lead to topics being more similar in terms of what words they contain.

Define # of k

Question 7

Q

Non-Negative Matrix Factorization (NMF)

Answer

A

A version of LDA in which parameters have been tweaked to enforce a sparse number o topics. Tends to produce a smaller number of topics really well. Not good for a large number of topics.

Cheap in computation so much faster than LDA
Doesn’t require as much tuning as LDA especially on noisy text (not topically pure)
Works well on small corpora
Some people say that it assigns document similar to how humans do
Works well for a small number of documents
Unstable algorithm (different results on same document). To address this, us NNDSVD - Nonnegative Double Singular Value Decomposition.

Question 8

Q

Working with Organic Topic Models (LSA, LDA, NMF)

Answer

A

Outputted topics have words like the, a, an, in

Forgot to remove stopwords

Question 9

Q

Applications of Topic Modelers

Answer

A

Movie recommendations
News article recommendations
Book recommender
Dating-website match recommender

Question 10

Q

Canonical Topic Modeling

Answer

A

Authorized source for topics. We can only select the topics from that list.
Standards org
Boss
Topic expert

How is it different than classification? We want to enable topic-driven exploration of the corpus by end users.

Constrain organic topic model to the canonical list of topics (cut of words that don’t occur in the topic list)

Use an Information Retrieval approach

Question 11

Q

Extension vs. Intension

Answer

A

concepts are extensionally related when they extend to some of the same referents in the world
concepts are intensionally related when their meanings (definitions) overlap
if extensionally and intensionally related together, then would be group words together for the topic model

Question 12

Q

Entity-Centric Topic Modeling

Answer

A

Centered around a list of people, e.g., sports, teams, countries, shows, leagues, years, seasons, franchises, etc.
1 way
-Canonical list of named entities (names)
-Context harvesting - try to find topics that come up in the context of those names
-Contextual organic topics

another approach (topic first)

List of canonical topics
get the names of the people, places, teams, companies, movies, etc that we are supposed to care about.

Question 13

Q

Singular Value Decomposition (SVD)

Answer

A

-Any set of vectors (A) can be expressed in terms of their lengths of projections (S) on some set of orthogonal axes (V)

Question 14

Q

Lift score

Answer

A

Likelihood of word being search in a super action such as visiting a certain website to rank the words by importance and then threshold it to obtain a set of words highly associated with the context.

Semantic Analysis: Topic Modeling Flashcards

(14 cards)