What is NLP?
produces machine-driven analyses of text
Why is NLP a hard problem
Language is ambiguous, multiple people may interpret it differently
Applications of NLP (amn)
What is corpus
collection of written texts that serve as a dataset
What are token and tokenization
a string of contiguous characters between two spaces can be an integer, real, number with a colon
converting text to tokens
What is text preprocessing + 3 steps
data is not analyzable without pre-processing steps - Noise removal - Lexicon normalization - object standardization
what is noise removal?
removal of all noisy entities in text, not relevant to data
what are stopwords
is, am common words
What is a general approach to noise removal?
What is lexicon normalization
converts all disparities of the word to normal form
converts high dimensionality to low dimensionality
player, played -> play
what are the most common normalization practices
Stemming and lemmatization
what is lemmatization
gets root of the word -> dictionary headword form
am are is -> be
car cars car’s -> car
what are morphemes
small meaningful units that makeup words
what is stemming
stemming is a rudimentary rule-based process to remove the suffix
- automate(s), automatic, automation reduced to automat
other text preprocessing steps (egs)
encoding-decoding noise
grammar checker
spelling correction
What are text-to features used for and list techniques? (SESW)
What is syntactical parsing, what does it involve, and what important attributes
what is dependency grammar?
What is POS tagging
Describe the POS tagging problem
- words often have more than one POS
where can POS tagging be used? (WINE)
Word sense disambiguation ( book )
Improving word-based features
Normalization and lemmatization
Efficient stopword removal
What are the most important chunk of sentence
Which algorithms are generally ensemble models of rule based parsing etc
Entities, Entity Detection algorithms
What is Named Entity Recognition (NER)
What are the three blocks NER has (NPE)