Text Preprocessing
Break up the text:
Sentence Segmentation
Treating each sentence as a token. A period, exclamation, or question mark can be used to segment sentences. Periods for abbreviations need to be handled so they aren’t necessarily treated as sentence breaks. Sentence tokenizers take these into account.
-Use case will determine if sentence segmentation is needed or if you can go straight to word tokenization
-95% of use cases can segment sentences by looking for:
1) a period, it ends a sentence
2) If the preceding token is i the hand-compiled list of abbreviations, then it doesn’t end a sentence.
3) If the next token is capitalized, then it ends a sentence.
Sentence tokenizers: Slr StandardTokenizer, Open NLP english.Tokernizer, OpenNLP SimpleTokenizer
Word Tokenization
Once sentences have been tokenized, then words can be tokenized. Contractions like “didn’t” isn’t tokenized. Punctuation can’t be included, e.g., jumping, You end up with a Bag of Words (BoW) and you can figure out word frequency.
Text Normalization
These involves cleaning or adjusting text so that it can be compared to other documents. Some examples include:
Function words
Function words are a closed class and lack content. They connect content words. They don't answer the 6 Ws (who, what, why, when, where, how). Examples: is, am, are, was, were, he, she, you, we, they, if, then, therefore, possibly. Y'all was the last new function word. Default list of stop words are the function words of a language, however it depends on the case on hand. Content words can be included as stop words.
Content Words
Content words are newly formed all the time and are considered an open class. Many many more content words than function words.
Handling misspellings
Two fundamental approaches:
Special short-circuits: some simple corrections can be made before invoking our spell correctors, like finding repeated characters “mostllly” or sticking a space accidentally between words.
Stemming
Break out a word into morphemes (cats morphemes are cat and s). Stems and affixes or suffixes. Reimagining, imagine is the stem, re is an affix. Can’t be words by themselves.
Low-level document feature extraction
Primary features:
Secondary features:
Terminology Extraction
- Get Collocations (bi-grams two words, trigrams, or n-grams)
Differential Frequency Analysis
-Analyze frequency comparing to other texts (turtle example)
Term Frequency Inverse Document Frequency (TF-IDF)
This is a way to determine a word’s importance in this document relative to how often important it is overall, multiplying term frequency by inverse document frequency
If a term is rare, then the number is bigger. If a word is common, it will be a smaller number. The higher the number, the more unusual that word is,
Lexical Diversity
How broad vocab is compared to length of document. Same words over and over or varied vocabulary. Compare one document to others.
Readability Formula
Automatic Tagging
Tag depends on the word and its context within a sentence.
-Lookup Tagger: Find the x most frequent words and store their most likely tag. If word doesn’t exist, then it gets the default tagger.
-N-gram tagging: Considered standard way to tag.
Context is the current word together with the POS tags of n-1 preceding tokens (e.g., n=3)