8b. Text Mining Process Flashcards

Question

Overview of the text mining process:

Answer 1

A

Data Acquisition
Tokenisation
Cleaning
Remove stop words
Spelling
Stemming and Iemmatisation

Answer 2

A

DATA ACQUISITION
(either downloading or through web scrapers)

Answer 3

A

Deciding what your unit of analysis is going to be. -

-> eg. the unit of analysis is the individual tweet - the words in the tweet are the TOKENS of the document

Answer 4

A

Removal of all non-meaningful text.

(HTML tags removed…. Emojis are not removed but converted into words..)

Answer 5

A

Just the removal of common non-meaningful words

-> eg. “a” or “the”

Answer 6

A

Correcting misspelt words.

Answer 7

A

Standardising words by reducing them to their common stem or Iemma.

-> eg. “car” and “cars” are stemmed to just “car” (stem)
-> eg. “car” and “cars” are turned to the Iemma “automobile” (iemma)

Stem = A stem is the basic form you get by chopping off endings. It is often mechanical and may not be a real dictionary word.
Iemma = True dictionary/base form of a word, based on its meaning and part of speech.

Answer 8

A

remember: rubbish in = rubbish out

Answer 9

A

Entity extraction
Topic extraction
Relation extraction

Answer 10

A

Extraction of individual entities (words) from text..
(most basic and common level of text mining)

Answer 11

A

Identifies general topics that appear ina text document.
(reduces to a few general topics)

Answer 12

A

Social media monitoring
Predictive modelling
Input to dictionaries

Answer 13

A

Focuses on discovering textual relationships among extracted text entities.

Answer 14

A

Useful to explore non-observable customer preferences

Answer 15

Study These Flashcards

A

Understand customer sentiment/ preferences in relation to specific product or service features

(very limited use today)

Answer 16

Study These Flashcards

A

Entity extraction

Most common + Most basic level

Answer 17

Study These Flashcards

A

TOPIC extraction

Answer 18

Study These Flashcards

A

Count (measure the frequency of each entity’s occurence)
Similarity (measure the similarity of the text between documents)
Accuracy (judges the accuracy of text measures relative to human-coded or externally validated documents)
Readability (judges readability level of text)