corpus
properties of corpora
markup (formatting)
common formats for structuring linguistic data
–> e.g., XML, CoNLL, JSON
preprocessing
transforming text into a useful format
–> e.g., tokenization
splitting sentences into words/tokens
tokenization
each word, including punctuation, is considered a separate token and assigned a sequential position
stemming + process
consider words within sentences that share a common root, but appear in different forms
reduce word to base/root form
–> this enables us to treat them as a single unit for analysis
process:
1. tokenize sentences
2. use stemming to reduce words into common stems
lemmatization
reduce words to common lemma’s
takes context into account to convert words to their base dictionary form
results in more linguistically correct forms than stemming
bag of words
text is represented by the frequency of its words, disregarding grammar and word order
considers their occurrences as features for analysis
simplifies text into numerical data
BOW process
n-gram
n-gram probabilities
calculation: occurrence of the sequence / size of the corpus
words in a corpus could appear multiple times. this affects the calculation of probabilities and the overall model we create
it is therefore essential that n-grams model the probability distribution of all possible sequences of all tokens in the corpus
what do n-gram models evaluate
the probability of a sequence of n tokens
this is useful for determining the likelihood of specific word sequences and is represented by the joint probability
chain rule
product of conditional probabilities
indicates that the probability of the entire sequence is the product of the individual probabilities of each word given all the preceding words
this allows us to systematically compute the probability of a sequence by considering one word at a time, given the context of all previous words in the sequence