NLP using rules:
Rule-based vs statistical methods
Hand crafted Decision trees
Why “Hand-crafted”?
Hand-crafted vs statistical decision trees
When to use?
For which tasks to use?
When to use?
- Decision tree structures get complicated fast
- The number of decision criteria to consider should be small
- The decision criteria should not be too interdependent
- Rule of thumb: few criteria with clear connections to outcomes
For which tasks to use?
- theoretically, there is no real restriction but practically, they are most used for shallow lexical or syntactic analyses
- Rule of thumb: the surface form of a text is enough for the decisions
Tokenization and sentence splitting
Tokenization:
Sentence splitting:
What first?
The default is to tokenize first, but both schedules exist
Sentence splitting with a decision tree:
Potential decision criteria for Tokenization and sentence splitting
Issues with decision criteria
Issues with decision trees
Decision trees get complex fast, already for few decision criteria
- The mutual effects of decision rules are hard to foresee
- Adding new decision criteria may change a tree drastically
Benefits and limitations of decision trees
Benefits
Limitations
Finite-State Transducers
FSA: is a state machine that read a string from a regular language, it represents the set of all strings belonging to the language
FST: extends an FSA in that reads one string and writes another, it represents the set of all relations between two sets of strings
Ways of employing an FST:
Morphological analysis as rewriting
Knowledge needed:
- Lexicon: Stems with affixes, together with morphological information
- Morphotactics: a model that explains which morpheme classes can follow others inside a word
- Orthographic rules: a model of the changes that may occur in a word, particularly when two morphemes combine
Word Normalization
-The conversion of all words in a text into some defined canonical form
- Used in NLP to identify different forms of the same word
Common character-level word normalizations:
- Case folding: Converting all letters to lower-case
- Removal of special characters: keep only letters and digits
- Removal of diacritical marks: Keep only plain letters without diacritics
Morphological normalization
- Identification of a single canonical representative for morphologically related wordforms
- Reduces inflections(and partly also derivations) to a common base
- Two alternative techniques: stemming and lemmatization
Stemming with FST
with affix elimination:
Porter stemmer
Steps:
1. Rewrite longest possible match of a given token with a set of defined character sequence patterns
2. Repeat Step 1 until no pattern matches the token anymore
Signature
- Input: A string s(representing a word)
- Output: The identified stem of s
Issues of Porter stemmer
Observations:
Benefits and Limitations of FST
Benefits of FST
Limitations:
Template-based generation
Template-based generation
Case Study
Data-to-text (Template-based generation)
Content determination:
Discourse planing:
Sentence Aggregation
Lexicalization:
Referring expression generation