Types of Classification
1) Content-based classification: at least two classes (binary classifier) or more than two (multi-class application). May require different types of classification. The shorter the text the less to go on. Supervised learning.
2) Descriptor-based classification: given a written description of what the classes of content are. Someone is making a request and there are no example documents yet. Legal discovery. Give me all company’s emails that discuss this product, etc. FOIA request. You have to write a description and then someone decides what matches that description. 10 or 30 sentences long so it’s classification by description rather than a search task. Less information today for descriptor-based text classification.
Subject-based Classifiction
Two Approaches to Subject-Based Automated Document Classifications:
1) Multinomial Naive Bayes classifiers
2) SVM-based classifiers
Naive Bayes Models
Class is what category. Predictor and the attributes or features used to make the prediction. Calculates the posterior probability of the class.
Prior Probability
Prior probability of the class: global distribution of individuals into that class. Look at the past.
Prior probability of the predictor: global distribution of the individuals into that predictor.
Posterior Probability
Posterior probability of the predictor: Probability of having the predictor attributes
Posterior probability of the class: Probability of falling into the target class. After the observation like term frequencies of the documents. Additional information that is taken into consideration.
Prior and Posterior
Before and after an observation
SVMs
Features
Are weighted term frequencies (TF-IDF)
Architecting a Classification System
1) Training docs. We have example documents that are sports news/not sports news
2) Sentence tokenizer
3) Word tokenizer
4) Stemmer
5) Possibly throw out stopwords, expand contractions
6) Feature Extraction to pull out features (the word the should not be a training feature)
7) Run it through a classifier (NB, SVM, etc.) to build a model
8) Test docs.
9) Normalize (follow 2 - 6 above)
10) Run it through the model and get a prediction
11) Compare to actual (model performance evaluation)
Descriptor-Based Classifiers - Case 1
Two-phase process:
1) Information Retrieval problem
User enters a robust description of the desired class fo documents. Pull out the chunks and use to do keyword searches based on the chunks. Examine all the docs and identifies strong hits. A strong hit is affected by the # of TF-IDF keywords found, phrase match, keyword sentence density of document, proximity of keywords, keywords found in title, subtitle, URL, filenames, image captions of alt tags, metadata keywords, hyperlinks, breadcrumb trail.
2) Content-Based Classification
Train, e.g., SVM machine on the strong hit documents
Descriptor-Based Classifiers - Case 2
Instead of a paragraph of description we could be given an empty taxonomy. I have news I want classified into sports, business, and politics, but they don’t have any training data with those labels.
-Get queries out of taxonomy because that’s all we have. Create Boolean queries. Sports and neither business nor politics, etc.
Works when taxonomy fits well
Multi-class vs. multi-label
multiple classes, mutually exclusive
multi-label, assigning multiple classes to a single observation
Chimera
Large-scale classification using machine learning, rules, and crowdsourcing
How do you add a new rule when you have millions of customers? Can’t take the system offline. Add a band-aid fix. Add a line of code or fuzzy map, if you see new Uber drive, create a category on the fly. This is a rule-based system. Threshold, e.g., after 10 or 15 rules then need to update and retrain the model.
Product Title Classification versus Text Classification
Symbolic vs. Neural approaches for Product Search
-Neural IR is the future, but right now the ideal systems are symbolic and IRs. In the next two years we could transition to IR.
The NLP pipeline is fragile, many steps and there could be problems if there is a breakdown in the steps
-Image encoding can be used for things like “t-shirt logo on back”. This could be combined with text encoding.
-Get training set through user interactions such as clicks, add to wishlist, and purchase
-Zalando.com example
-