Low-Level Analysis Flashcards

Question 1

Q

Text Preprocessing

Answer

A

Break up the text:

Sentence segmentation (sentence tokenization). Treat each sentence as a token.
Lexical analysis (word tokenization). Treat each word as a token.

Question 2

Q

Sentence Segmentation

Answer

A

Treating each sentence as a token. A period, exclamation, or question mark can be used to segment sentences. Periods for abbreviations need to be handled so they aren’t necessarily treated as sentence breaks. Sentence tokenizers take these into account.
-Use case will determine if sentence segmentation is needed or if you can go straight to word tokenization
-95% of use cases can segment sentences by looking for:
1) a period, it ends a sentence
2) If the preceding token is i the hand-compiled list of abbreviations, then it doesn’t end a sentence.
3) If the next token is capitalized, then it ends a sentence.
Sentence tokenizers: Slr StandardTokenizer, Open NLP english.Tokernizer, OpenNLP SimpleTokenizer

Question 3

Q

Word Tokenization

Answer

A

Once sentences have been tokenized, then words can be tokenized. Contractions like “didn’t” isn’t tokenized. Punctuation can’t be included, e.g., jumping, You end up with a Bag of Words (BoW) and you can figure out word frequency.

Problem cutting on space for words like New Mexico, South America, etc. (state entities tokenized together, etc.)
Examples that get interesting, small words ( xp, ma, j lo), hyphenated words (e-bay, wal-mart, t-shirts), special characters (URLs, code), capitalized words (Bush vs. Apple), apostrophes (can’t, 80’s, master’s degree, numbers (nokia 3250, unites 93, etc.), periods (I.B.M, Ph.D.)

Question 4

Q

Text Normalization

Answer

A

These involves cleaning or adjusting text so that it can be compared to other documents. Some examples include:

Contractions (we expand them)
Removal of stop words (the, and, a)
Resolving misspellings
Stemming, if needed (so that words like run, running, ran, etc. are not treated as words that are not associated with each other). Stemming reduces a word to its root form. Examples: Algorithmic stemmer uses a program to decide whether two words are related based on word suffixes, dictionary based stemmer relies on pre-created dictionaries of related words. Porter stemmer is an example of dictionary-based stemmer.
Lemmatization: Uses lexical knowledge bases such as WordNet to obtain word base forms.

Question 5

Q

Function words

Answer

A

Function words are a closed class and lack content. 
 They connect content words.  They don't answer the 6 Ws (who, what, why, when, where, how).  Examples: 
 is, am, are, was, were, he, she, you, we, they, if, then, therefore, possibly.  Y'all was the last new function word.   Default list of stop words are the function words of a language, however it depends on the case on hand.  Content words can be included as stop words.

Question 6

Q

Content Words

Answer

A

Content words are newly formed all the time and are considered an open class. Many many more content words than function words.

Question 7

Q

Handling misspellings

Answer

A

Two fundamental approaches:

Edit-distance method: Checks how many edits you would have to make to turn it into a properly spelled word, add one character, delete one character, replace one character. Returns the word with the lowest edit distance. Stops after 3 edits.
Fuzzy string comparison: Characters in common as a percentage of total characters.

Special short-circuits: some simple corrections can be made before invoking our spell correctors, like finding repeated characters “mostllly” or sticking a space accidentally between words.

Question 8

Q

Stemming

Answer

A

Break out a word into morphemes (cats morphemes are cat and s). Stems and affixes or suffixes. Reimagining, imagine is the stem, re is an affix. Can’t be words by themselves.

Question 9

Q

Low-level document feature extraction

Answer

A

Primary features:

We are examining only the document itself
Word frequencies, collocations (n-grams)

Secondary features:

Requires us to compare features of the document to those of other documents
Differential frequency analysis (TF/IDF)
Relative lexical diversity
Reading level

Question 10

Q

Terminology Extraction

Answer

A

Frequency based extraction, remove stop words

- Get Collocations (bi-grams two words, trigrams, or n-grams)

Question 11

Q

Differential Frequency Analysis

Answer

A

-Analyze frequency comparing to other texts (turtle example)

Question 12

Q

Term Frequency Inverse Document Frequency (TF-IDF)

Answer

A

This is a way to determine a word’s importance in this document relative to how often important it is overall, multiplying term frequency by inverse document frequency

Term frequency: How often a term occurs in a document
Document frequency: How common the term is within a domain represented by a corpus of documents
Inverse document frequency: Dividing the total number of documents in the corpus by the number of documents containing our target term, and applying a log scale.

If a term is rare, then the number is bigger. If a word is common, it will be a smaller number. The higher the number, the more unusual that word is,

Question 13

Q

Lexical Diversity

Answer

A

How broad vocab is compared to length of document. Same words over and over or varied vocabulary. Compare one document to others.

Question 14

Q

Readability Formula

Answer

A

Needs to be a secondary feature by comparing to other documents
Compare readability to a list of words of known and mastered by 80% of 4th graders (Dale-Chall readability formula)
readabilityformulas.com

Question 15

Q

Automatic Tagging

Answer

A

Tag depends on the word and its context within a sentence.

-Lookup Tagger: Find the x most frequent words and store their most likely tag. If word doesn’t exist, then it gets the default tagger.

-N-gram tagging: Considered standard way to tag.
Context is the current word together with the POS tags of n-1 preceding tokens (e.g., n=3)

Low-Level Analysis Flashcards

(15 cards)