What are the 4 main steps in the “text mining” process?
What are the 6 steps in data “pre-processing”?
What is the first (most foundational) step in data “pre-processing”?
DATA ACQUISITION
(either downloading or through web scrapers)
What is the “tokenisation” step in data “pre-processing”?
Deciding what your unit of analysis is going to be. -
-> eg. the unit of analysis is the individual tweet - the words in the tweet are the TOKENS of the document
What is the “cleaning” step in data “pre-processing”?
Removal of all non-meaningful text.
(HTML tags removed…. Emojis are not removed but converted into words..)
What is the “removing stop words” step in data “pre-processing”?
Just the removal of common non-meaningful words
-> eg. “a” or “the”
What is the “spelling” step in data “pre-processing”?
Correcting misspelt words.
What is the “stemming and Immatisation” step in data “pre-processing”?
Standardising words by reducing them to their common stem or Iemma.
-> eg. “car” and “cars” are stemmed to just “car” (stem)
-> eg. “car” and “cars” are turned to the Iemma “automobile” (iemma)
Stem = A stem is the basic form you get by chopping off endings. It is often mechanical and may not be a real dictionary word.
Iemma = True dictionary/base form of a word, based on its meaning and part of speech.
Why is the pre-processing vital?
remember: rubbish in = rubbish out
What are the 3 types of text analysis “Extraction”?
What is “entity extraction”?
Extraction of individual entities (words) from text..
(most basic and common level of text mining)
What is “topic extraction”?
Identifies general topics that appear ina text document.
(reduces to a few general topics)
What are some applications of “entity extraction”?
What is “relation extraction”?
Focuses on discovering textual relationships among extracted text entities.
What are some applications of “topic extraction”?
What are some applications of “relation extraction”?
(very limited use today)
What is the most common level of text mining (text analysis “Extraction”)?
Entity extraction
Most common + Most basic level
Which level of text mining (“Extraction”) is focused on in this module?
TOPIC extraction
What are the 4 text metrics used?
What is done in step 3 (text metrics)?
Text is converted into quantifiable measures
What is the last step (#4) of text mining?
Assessing the validity of extracted text and measures
What are the two types of validity that are dealt with in the last step (#4) of text mining?
What is “internal validity”?
Internal validity is everything that has to do about whether the text accurately measures the constructs and the relationship between them
What is “external validity”?
External validity looks at whether the text-based findings do also apply to phenomena outside the study