Text pre-processing
Document level preparation
- Document conversion
- Language / Domain identification
Tokenisation
- Case Folding
Basic lexical pre-processing
- Lemmatisation
- Stemming
- Spelling corrections
Case folding
Make everything the same case:
Parsnips -> parsnips
Lemmatization of parsnips
parsnip
Stemming of automated
automat