Define natural language processing
goal is to make machines to understand and interpret human language the way it is written or spoken
What are different levels of linguistic analysis used in NLP
What does NLU do
tries to understand the meaning of given text – the nature and structure of each word inside text must be known for
NLU
What are applications of natural language understanding
What is sentence segmentation
identify sentence boundaries in the given text, i.e., where one sentence ends and
where another sentence begins; sentences are often marked ended with punctuation mark ‘.’
What is tokenisation
identify different words, numbers, and other punctuation symbols
What is stemming
strip the ending of words like ‘eating’ is reduced to ‘eat’
What is part of speech (POS) tagging
assign each word in a sentence its own part-of-speech tag such as designating
word as noun or adverb
What is named entity recognition
identify entities such as persons, location and time within the documents
What is co reference (discourse) resolution
define the relationship of the given word in a sentence with previous and
next sentence
What does stemming do
try to find the base form of words
Stemming usually refers to a crude heuristic process that chops off the ends of words
in the hope of achieving this goal correctly most of the time, and often includes the removal
of derivational affixes.
What does lemmaliser do
Lemmatization usually refers to doing things properly with the use of a vocabulary and
morphological analysis of words, normally aiming to remove inflectional endings only
and to return the base or dictionary form of a word, which is known as the lemma.
What are stop words
Stop words usually refer to the most common words such as “and”
“the”, “a” in a language,
but there is no single universal list of stopwords. The list of the stop words can change
depending on your application.
What is part of speech
The part of speech explains how a word is used in a sentence. There are eight main parts of speech - nouns,
pronouns, adjectives, verbs, adverbs, prepositions, conjunctions and interjections.
WHat is bag of words
Any information about the order or structure of words is discarded. That’s why it’s called
a bag of words. This model is trying to understand whether a known word occurs in a
document, but don’t know where is that word in the document.
The intuition is that similar documents have similar contents. Also, from a content, we can
learn something about the meaning of the document.
What is Term Frequency(TF)
a scoring of the frequency of the word in the current document.
This part measures how often a word appears in a document.
The more frequently a word appears, the higher its TF score for that document.
what is inverse Term. frequency
a scoring of how rare the word is across documents.
This part measures how unique or rare a word is across multiple documents in the corpus.
The rarer the word, the higher its IDF score.
What does N-Gram do
basic idea underlying the statistical approach to word prediction is to use the probabilities of SEQUENCES OF
WORDS to choose the most likely next word
What is the Markov assumption
only prior local context – the last few words – affects the next word
* making the Markov assumption for word prediction means assuming
that the probability of a word only depends on the previous N-1 words
(N-GRAM model)