problems parsing documents
token
individual words in a text (syntaktische Einheit)
term/type
individual words in a text vocabular
morphem
smallest grammatical/meaningful unit in a language
inflection
case, conjugation, comparative words
basic form
word without any inflection: nomative singular for nouns, infinitive for verbs, adjectives without comparative
derivation
forming a new word from an existing word, often by adding prefix or suffix
compound
a word that consist of more than one stem
noun phrase
a phrase wich has generally a noun as its head word
tokenization problems
specific problems
tokenization
splitting text into smaller semantical units
tokenization problems
general problems
normalization
reuce text to pre-defined form
normalization techniques
grammatical markings
stemming
lemmatization
(normalization techniques)
grammatical markings
(normalization techniques)
stemming
(normalization techniques)
lemmatization
case folding
stop words
(equivalence classing)
soundex
(equivalence classing)
thesauri
(equivalence classing)
context dependent typings
-same word, meaning dependent on context
(equivalence classing)
transliteration
problems with stemming
-error prone
- language dependent
- often inflectional and derivational