Vl 2 Document Processing Flashcards

(34 cards)

1
Q

problems parsing documents

A
  • format
  • language
  • reading direction
  • character set
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

token

A

individual words in a text (syntaktische Einheit)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

term/type

A

individual words in a text vocabular

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

morphem

A

smallest grammatical/meaningful unit in a language

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

inflection

A

case, conjugation, comparative words

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

basic form

A

word without any inflection: nomative singular for nouns, infinitive for verbs, adjectives without comparative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

derivation

A

forming a new word from an existing word, often by adding prefix or suffix

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

compound

A

a word that consist of more than one stem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

noun phrase

A

a phrase wich has generally a noun as its head word

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

tokenization problems
specific problems

A
  • elimination of stop words
  • end of sentences
  • contraction
  • where to seperate words
  • numbers
  • apostrophes
  • special words
  • same word different meaning (lower/ upper case)
  • differences in languages
  • no whitespaces (compound words)
  • accents
  • accents are often not used even if grammaticaly needed
  • umlauts
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

tokenization

A

splitting text into smaller semantical units

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

tokenization problems
general problems

A
  • how are users likely to write their queries
  • tokenization steps for queries must be identical to steps for documents
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

normalization

A

reuce text to pre-defined form

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

normalization techniques

A

grammatical markings
stemming
lemmatization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

(normalization techniques)
grammatical markings

A
  • used to differentiate different forms of a word
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

(normalization techniques)
stemming

A
  • focus on syntax
  • an operation that strips off grammatical markings to leave the stem
  • chopping of end
17
Q

(normalization techniques)
lemmatization

A
  • focus on semantics
  • an operation that specifies the lemma corresponding to a word form together with corresponding morpho-syntactic and possible morpho-derivational information
18
Q

case folding

A
  • reduces all letters to lower case
  • exceptions possible
19
Q

stop words

A
  • extremly common words
  • not helpfull usally
  • exceptions
20
Q

(equivalence classing)
soundex

A
  • use heuristics to equivalence class or expand terms with phonetic equivalents
  • used for foreign/ complex words
21
Q

(equivalence classing)
thesauri

A
  • semantic equivalence
22
Q

(equivalence classing)
context dependent typings

A

-same word, meaning dependent on context

23
Q

(equivalence classing)
transliteration

A
  • sience that is concerend with 1:1 mapping of symbols in a script
  • 1:1 mapping not always possible
24
Q

problems with stemming

A

-error prone
- language dependent
- often inflectional and derivational

25
stemming algorithms
- table lookup approach - successor variety - n-gram stemmers - affix removal stemmers
26
(stemming algorithms) table lookup approach
-table with stems and fitting terms - searching terms, returning the associated stem
27
(stemming algorithms) table lookup approach problems
- labor intensive - missing exceptional cases - storage overhead - often only with technical terms
28
(stemming algorithms) N-gram Stemmers
-association measures calculated between pairs of terms based on shared unique n-grams - use Dices coefficient to determine similarity
29
(stemming algorithms) affix removal stemmers
-remove suffixes and/or prefixes -Lovins, Dawsons, Porters, Paice
30
(stemming algorithms) Porter Algorithm
-standard for english - set of context-sensetive rewriting rules
31
Part-of-Speech-Tagging
- labeling each word in a sentence with its appropriate part of speech
32
(POS Tagging) Simple Stochastic Tagger
- each word is assigned its most frequent tag - may create unacceptable tag sequence
33
Named Enitity Recognition
- locates and classifies named entities in a unstrucured text into predefined categories
34
Dependency Parsing
- words are connected to each other by direct links - parses a sentences into its grammatical strucure