chapter 7 part1.5 Flashcards

(135 cards)

1
Q

True or False: Most corporate data is structured in databases.

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How fast is unstructured corporate data doubling in size?

A

Every 18 months

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

According to the slides, tapping into unstructured information sources is not an option but a ______ to stay competitive.

A

need

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is text mining (exact slide definition)?

A

A semi automated process of extracting knowledge from unstructured data sources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Text mining is also called what (exact slide wording)?

A

Text data mining or knowledge discovery in textual databases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

True or False: The benefits of text mining are especially obvious in text rich data environments.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Give one example of a text rich environment mentioned in the slides.

A

Law academic research finance medicine biology technology or marketing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

In law, what type of text is given as an example for text mining?

A

Court orders

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

In academic research, what type of text is given as an example for text mining?

A

Research articles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

In finance, what type of text is given as an example for text mining?

A

Quarterly reports

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

In medicine, what type of text is given as an example for text mining?

A

Discharge summaries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

In biology, what type of text is given as an example for text mining?

A

Molecular interactions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

In technology, what type of text is given as an example for text mining?

A

Patent files

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

In marketing, what type of text is given as an example for text mining?

A

Customer comments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Email is an example of what kind of records mentioned in the slides?

A

Electronic communication records

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

List one application of text mining for email records.

A

Spam filtering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

List another application of text mining for email records.

A

Email prioritization and categorization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

List another application of text mining for email records.

A

Automatic response generation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is text analytics (exact slide wording)?

A

A broader concept that includes information retrieval text mining data mining web mining and NLP

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

True or False: Text analytics is narrower than text mining.

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is information retrieval (exact slide definition)?

A

Searching and identifying relevant documents for a given set of key terms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

According to Figure 7.2, text analytics is enabled by which disciplines (list as shown)?

A

Statistics machine learning management science artificial intelligence computer science and other disciplines

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

In Figure 7.2, name one item shown under Information Retrieval.

A

Document matching

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

In Figure 7.2, name another item shown under Information Retrieval.

A

Link analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
In Figure 7.2, name another item shown under Information Retrieval.
Search engines
26
In Figure 7.2, name one item shown under Natural Language Processing.
POS tagging
27
In Figure 7.2, name another item shown under Natural Language Processing.
Lemmatization
28
In Figure 7.2, name another item shown under Natural Language Processing.
Word disambiguation
29
In Figure 7.2, name one type of Web Mining shown.
Web content mining
30
In Figure 7.2, name another type of Web Mining shown.
Web structure mining
31
In Figure 7.2, name another type of Web Mining shown.
Web usage mining
32
In Figure 7.2, name one technique shown under Data Mining.
Classification
33
In Figure 7.2, name another technique shown under Data Mining.
Clustering
34
In Figure 7.2, name another technique shown under Data Mining.
Association
35
In Figure 7.2, what is the label inside the central blue circle?
Text mining knowledge discovery in textual data
36
True or False: Data mining and text mining both seek for novel and useful patterns.
True
37
True or False: Data mining and text mining are both semi automated processes.
True
38
What is the key difference between data mining and text mining (exact slide wording)?
The nature of the data structured versus unstructured data
39
Structured data is found where (exact slide wording)?
In databases
40
Unstructured data examples listed on the slide include what?
Word documents PDF files text excerpts XML files and so on
41
To perform text mining, what must be done first (exact slide wording)?
First impose structure to the data then mine the structured data
42
What is information extraction?
Identifying key phrases and relationships within text by looking for predefined objects and sequences in text by way of pattern matching
43
What is topic tracking (exact slide definition as shown)?
Identifying key phrases and relationships within text by looking for predefined objects and sequences in text by way of pattern matching
44
What is summarization (exact slide definition)?
Summarizing a document to save the reader time
45
What is categorization (exact slide definition)?
Identifying the main themes of a document and then placing the document into a predefined set of categories based on those themes
46
What is Clustring?
Grouping similar documents without having a predefined set of categories
47
What is concept linking (exact slide definition as shown)?
Grouping similar documents without having a predefined set of categories
48
What is question answering (exact slide definition)?
Finding the best answer to a given question through knowledge driven pattern matching
49
True or False: Text mining terminology includes unstructured or semistructured data.
True
50
Definition: A large and structured set of texts, prepared for the purpose of conducting knowledge discovery ## Footnote now usually stored and processed electronically
Corpus
51
Fill in the blank: The plural of corpus is ______.
corpora
52
Definition: A single word or multiword phrase extracted directly from the corpus of a specific domain by means of NLP methods
Term
53
Definition: Features generated from a collection of documents by means of manual statistical rule based or hybrid categorization methodology.
Concepts
54
Compared to terms, concepts are the result of what (exact slide wording)?
Higher level abstraction
55
Definition: The process of reducing inflected words to their stem or base or root form
Stemming
56
According to the slide, stemmer stemming stemmed are all based on what root?
stem
57
Definition: Words that are filtered out prior to or after processing natural language data
Stop Words / Noise Words
58
Give examples of stop words listed in the slide.
a an the of on
59
Definition: Syntactically different words spelled differently with identical or at least similar meanings
Synonyms
60
Which example synonyms are listed in the slide?
movie film and motion picture
61
Definition: Syntactically identical words spelled exactly the same with different meanings.
Polysemes
62
The word bow can mean what three things (as listed on the slide)?
To bend forward the front of the ship and the weapon that shoots arrows
63
Bow has multiple meanings. This makes bow which terminology of text mining?
Polysemes or homonyms
64
Definition: a categorized block of text in a sentence.
Tokenizing
65
How is the block of text corresponding to the token categorized?
According to the function it performs
66
Definition: A collection of terms specific to a narrow field that can be used to restrict the extracted terms within a corpus
Term Dictionary
67
Definition: The number of times a word is found in a specific document
Word Frequency
68
Definition: The process of marking the words in a text as corresponding to a particular part of speech. ## Footnote based on a **word’s definition** and **the context** in which it is used
Part-Of-Speech tagging
69
What is morphology?
The branch of linguistics and a part of NLP that studies the internal structure of words.
70
Definition: The common representation schema of the frequency based relationship between the terms and documents in tabular format
Term-by-document
71
In a term by document matrix, terms are listed in which direction?
Columns
72
In a term by document matrix, documents are listed in which direction?
Rows
73
In a term by document matrix, what is stored in the cells (exact slide wording)?
The frequency between the terms and documents as integer values
74
What is singular value decomposition or latent semantic indexing (exact slide definition)?
This dimensionality reduction method is used to transform the term by document matrix to a manageable size by generating an intermediate representation of the frequencies using a matrix manipulation method
75
What term in text mining is a dimensionality reduction method?
Singular value decomposition
76
What is the point of singular value decomposition according to the slides?
To transform the term by document matrix to a manageable size
77
How does singular value decomposition make the term by document matrix manageable (exact slide wording)?
By generating an intermediate representation of the frequencies
78
How does singular value decomposition generate the intermediate representation (exact slide wording)?
Using a matrix manipulation method
79
NLP is described as a very important concept in what area?
Text mining
80
NLP is a subfield of what two areas?
Artificial intelligence and computational linguistics
81
NLP studies what (exact slide wording)?
Understanding the natural human language
82
According to the slide, natural language is vague and ______ driven.
context
83
True understanding requires extensive knowledge of a ______.
topic
84
What question does the slide ask about computers and natural language understanding?
Can or will computers ever understand natural language the same accurate way we do
85
What are challenges in NLP according to the slide?
* Issues related to spoken language. * different meanings of words. * the context in which the words are spoken
86
What is the dream of the AI community (exact slide wording)?
To have algorithms that are capable of automatically reading and obtaining knowledge from text
87
What is WordNet (exact slide definition)?
A laboriously hand coded database of English words their definitions sets of synonyms and various semantic relations between synonym sets
88
Why is WordNet very expensive (exact slide wording)?
To build and maintain manually
89
WordNet is a major resource for what (exact slide wording)?
NLP applications
90
According to the slide, WordNet needs what to be completed?
Automation
91
Where has WordNet shown impact according to the slide?
CRM and sentiment analysis
92
What is sentiment analysis (exact slide definition)?
A technique used to detect favorable and unfavorable opinions toward specific products and services
93
Name an NLP task category listed on the slide.
Question answering
94
Name an NLP task category listed on the slide.
Automatic summarization
95
Name an NLP task category listed on the slide.
Natural language generation
96
Name an NLP task category listed on the slide.
Natural language understanding
97
Name an NLP task category listed on the slide.
Machine translation
98
Name an NLP task category listed on the slide.
Foreign language reading & writing
99
Name an NLP task category listed on the slide.
Speech recognition
100
Name an NLP task category listed on the slide.
Text to Speech
101
Name an NLP task category listed on the slide.
Text proofing
102
Name an NLP task category listed on the slide.
Optical character recognition
103
In the context diagram, what are the two inputs shown?
Unstructured data text and structured data databases
104
In the context diagram, what is the main activity labeled?
Extract knowledge from available data sources
105
In the context diagram, what is the output shown?
Context specific knowledge
106
List the controls constraints shown in the context diagram.
Software hardware limitations privacy issues linguistic limitations
107
List the mechanisms shown in the context diagram.
Domain expertise tools and techniques
108
Figure 7.6 Task 1 is called what?
Establish the corpus collect and organize the domain specific unstructured data
109
Figure 7.6 Task 1 output is what?
A collection of documents in some digitized format for computer processing
110
Figure 7.6 gives which example of digitized format?
ASCII text files
111
Figure 7.6 Task 2 is called what?
Create the term document matrix introduce structure to the corpus
112
Figure 7.6 Task 2 output is what?
A flat file called a term document matrix where the cells are populated with the term frequencies
113
Figure 7.6 Task 3 is called what?
Extract knowledge discover novel patterns from the T D matrix
114
Figure 7.6 Task 3 output is what?
Problem specific classification association clustering models and visualizations
115
Step 1 of text mining process establish the corpus: what is the first action (exact slide wording)?
Collect all relevant unstructured data
116
List the unstructured data examples given in Step 1.
Textual documents XML files emails web pages short notes voice recordings
117
Step 1 digitize and standardize the collection for example by converting all to what?
ASCII text files
118
Step 1 in text mining process place the collection where?
In a common place such as in a flat file or in a directory as separate files
119
Step 2 is called what?
Create the term by document matrix
120
In the TDM diagram, what do documents correspond to?
Rows
121
In the TDM diagram, what do terms correspond to?
Columns
122
In the TDM, what do the numbers in the cells represent?
Frequencies of terms in documents
123
When creating the TDM, should all terms be included?
Not necessarily
124
When creating the TDM, which issues are listed to consider?
Stop words include words synonyms homonyms stemming
125
The TDM is a sparse matrix. What is one manual way to reduce its dimensionality?
A domain expert goes through it
126
What is another way to reduce TDM dimensionality mentioned?
Eliminate terms with very few occurrences in very few documents
127
What transformation method is mentioned to reduce dimensionality?
Singular value decomposition S V D
128
Step 3 of text mining process is called what?
Extract patterns knowledge
129
Step 3 of text mining process : classification corresponds to what phrase in parentheses?
Text categorization
130
Step 3 technique in the text mining process: clustering corresponds to what phrase in parentheses?
Natural groupings of text
131
List one more Step 3 technique.
Association
132
List one more Step 3 technique.
Trend analysis
133
Which terminology term means a collection of terms specific to a narrow field used to restrict extracted terms within a corpus?
Term dictionary
134
Which terminology term means the number of times a word is found in a specific document?
Word frequency
135
Which terminology term means marking words as nouns verbs adjectives adverbs etc based on definition and context?
Part of speech tagging