sources of bias
importance of data
text classification
corpora help us with text classification
goal: assign a label or category to a specific piece of text
why use text classification
sentiment analysis
goal: predict the sentiment expressed in a piece of text (+, - , scale rating)
why is sentiment analysis hard
other text classification problems
questions when building a sentiment classifier
data-driven evaluation
choose a dataset for evaluation before you build a system
why is data-driven evaluation important
where to get a corpus
gold labels
annotations used to evaluate and compare sentiment analyzers
these can be
1. derived automatically from the original data artifact (metadata such as starratings)
2. added by human annotator who reads the text (but how to address trouble with deciding and agreeing between annotators)
sentiment analysis training data
(X,Y) pairs to learn h(X)
–> (input, output)
–> relies heavily on accurately labeled data
–> this is text classification
accuracy
confusion matrix
precision
precise model
might not find all positives, but the ones that the model does classify as positive are very likely to be correct
not precise model
may find a lot of positives, but its selection method is noisy. it wrongly detects many positives that arent true positives.
recall (sensitivity)
model with high recall
succeeds in finding all positive cases, even though it might also wrongly identify some negative cases as positive cases
model with low recall
not able to find all or a large part of positive cases
when to use precision vs recall
tuning for high precision
the system should not make a mistake
tuning for high recall
the system should not miss a case