8b. Text Mining Process Flashcards

(25 cards)

1
Q

What are the 4 main steps in the “text mining” process?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 6 steps in data “pre-processing”?

A
  • Data Acquisition
  • Tokenisation
  • Cleaning
  • Remove stop words
  • Spelling
  • Stemming and Iemmatisation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the first (most foundational) step in data “pre-processing”?

A

DATA ACQUISITION
(either downloading or through web scrapers)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the “tokenisation” step in data “pre-processing”?

A

Deciding what your unit of analysis is going to be. -

-> eg. the unit of analysis is the individual tweet - the words in the tweet are the TOKENS of the document

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the “cleaning” step in data “pre-processing”?

A

Removal of all non-meaningful text.

(HTML tags removed…. Emojis are not removed but converted into words..)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the “removing stop words” step in data “pre-processing”?

A

Just the removal of common non-meaningful words

-> eg. “a” or “the”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the “spelling” step in data “pre-processing”?

A

Correcting misspelt words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the “stemming and Immatisation” step in data “pre-processing”?

A

Standardising words by reducing them to their common stem or Iemma.

-> eg. “car” and “cars” are stemmed to just “car” (stem)
-> eg. “car” and “cars” are turned to the Iemma “automobile” (iemma)

Stem = A stem is the basic form you get by chopping off endings. It is often mechanical and may not be a real dictionary word.
Iemma = True dictionary/base form of a word, based on its meaning and part of speech.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why is the pre-processing vital?

A

remember: rubbish in = rubbish out

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the 3 types of text analysis “Extraction”?

A
  1. Entity extraction
  2. Topic extraction
  3. Relation extraction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is “entity extraction”?

A

Extraction of individual entities (words) from text..
(most basic and common level of text mining)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is “topic extraction”?

A

Identifies general topics that appear ina text document.
(reduces to a few general topics)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are some applications of “entity extraction”?

A
  • Social media monitoring
  • Predictive modelling
  • Input to dictionaries
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is “relation extraction”?

A

Focuses on discovering textual relationships among extracted text entities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are some applications of “topic extraction”?

A
  • Useful to explore non-observable customer preferences
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are some applications of “relation extraction”?

A
  • Understand customer sentiment/ preferences in relation to specific product or service features

(very limited use today)

17
Q

What is the most common level of text mining (text analysis “Extraction”)?

A

Entity extraction

Most common + Most basic level

18
Q

Which level of text mining (“Extraction”) is focused on in this module?

A

TOPIC extraction

19
Q

What are the 4 text metrics used?

A
  • Count (measure the frequency of each entity’s occurence)
  • Similarity (measure the similarity of the text between documents)
  • Accuracy (judges the accuracy of text measures relative to human-coded or externally validated documents)
  • Readability (judges readability level of text)
20
Q

What is done in step 3 (text metrics)?

A

Text is converted into quantifiable measures

21
Q

What is the last step (#4) of text mining?

A

Assessing the validity of extracted text and measures

22
Q

What are the two types of validity that are dealt with in the last step (#4) of text mining?

A
  • Internal validity
  • External validity
23
Q

What is “internal validity”?

A

Internal validity is everything that has to do about whether the text accurately measures the constructs and the relationship between them

24
Q

What is “external validity”?

A

External validity looks at whether the text-based findings do also apply to phenomena outside the study

25
Overview of the text mining process: