Vl 2 Document Processing Flashcards by Clara Zornow

problems parsing documents

format
language
reading direction
character set

How well did you know this?

Not at all

Perfectly

token

individual words in a text (syntaktische Einheit)

How well did you know this?

Not at all

Perfectly

term/type

individual words in a text vocabular

How well did you know this?

Not at all

Perfectly

morphem

smallest grammatical/meaningful unit in a language

How well did you know this?

Not at all

Perfectly

inflection

case, conjugation, comparative words

How well did you know this?

Not at all

Perfectly

basic form

word without any inflection: nomative singular for nouns, infinitive for verbs, adjectives without comparative

How well did you know this?

Not at all

Perfectly

derivation

forming a new word from an existing word, often by adding prefix or suffix

How well did you know this?

Not at all

Perfectly

compound

a word that consist of more than one stem

How well did you know this?

Not at all

Perfectly

noun phrase

a phrase wich has generally a noun as its head word

How well did you know this?

Not at all

Perfectly

tokenization problems
specific problems

elimination of stop words
end of sentences
contraction
where to seperate words
numbers
apostrophes
special words
same word different meaning (lower/ upper case)
differences in languages
no whitespaces (compound words)
accents
accents are often not used even if grammaticaly needed
umlauts

How well did you know this?

Not at all

Perfectly

tokenization

splitting text into smaller semantical units

How well did you know this?

Not at all

Perfectly

tokenization problems
general problems

how are users likely to write their queries
tokenization steps for queries must be identical to steps for documents

How well did you know this?

Not at all

Perfectly

normalization

reuce text to pre-defined form

How well did you know this?

Not at all

Perfectly

normalization techniques

grammatical markings
stemming
lemmatization

How well did you know this?

Not at all

Perfectly

(normalization techniques)
grammatical markings

used to differentiate different forms of a word

How well did you know this?

Not at all

Perfectly

(normalization techniques)
stemming

Study These Flashcards

focus on syntax
an operation that strips off grammatical markings to leave the stem
chopping of end

(normalization techniques)
lemmatization

Study These Flashcards

focus on semantics
an operation that specifies the lemma corresponding to a word form together with corresponding morpho-syntactic and possible morpho-derivational information

case folding

Study These Flashcards

reduces all letters to lower case
exceptions possible

stop words

Study These Flashcards

extremly common words
not helpfull usally
exceptions

(equivalence classing)
soundex

Study These Flashcards

use heuristics to equivalence class or expand terms with phonetic equivalents
used for foreign/ complex words

(equivalence classing)
thesauri

Study These Flashcards

semantic equivalence

(equivalence classing)
context dependent typings

Study These Flashcards

-same word, meaning dependent on context

(equivalence classing)
transliteration

Study These Flashcards

sience that is concerend with 1:1 mapping of symbols in a script
1:1 mapping not always possible

problems with stemming

Study These Flashcards

-error prone
- language dependent
- often inflectional and derivational

stemming algorithms

- table lookup approach - successor variety - n-gram stemmers - affix removal stemmers

(stemming algorithms) table lookup approach

-table with stems and fitting terms - searching terms, returning the associated stem

(stemming algorithms) table lookup approach problems

- labor intensive - missing exceptional cases - storage overhead - often only with technical terms

(stemming algorithms) N-gram Stemmers

-association measures calculated between pairs of terms based on shared unique n-grams - use Dices coefficient to determine similarity

(stemming algorithms) affix removal stemmers

-remove suffixes and/or prefixes -Lovins, Dawsons, Porters, Paice

(stemming algorithms) Porter Algorithm

-standard for english - set of context-sensetive rewriting rules

Part-of-Speech-Tagging

- labeling each word in a sentence with its appropriate part of speech

(POS Tagging) Simple Stochastic Tagger

- each word is assigned its most frequent tag - may create unacceptable tag sequence

Named Enitity Recognition

- locates and classifies named entities in a unstrucured text into predefined categories

Dependency Parsing

- words are connected to each other by direct links - parses a sentences into its grammatical strucure

Vl 2 Document Processing Flashcards

(34 cards)