Text Normalization, on what it depends?
Text normalization is the process of transforming a text into some predefined standard form.
There is no all-purpose normalization procedure. Text normalization depends on:
Text normalization: contraction, punctuation and special characters
Tokenization and groups of techniques
Is the process of segmenting text into units called tokens.
Tokenization techniques can be grouped into three families:
Tokens are then organized into a vocabulary and, depending on the specific NLP application, may later be mapped into natural numbers.
Word tokenization, in what languages it is used? Exceptions…
Word tokenization is the most common approach for European languages.
Important exceptions for english:
There are certain language-independent tokens that require specialized processing:
Character tokenization
Major east Asian languages (e.g., Chinese, Japanese, Korean, and Thai) write text without any spaces between words.
Each character generally represents a single unit of meaning.
Word tokenization results in huge vocabulary, with large number of very rare words
Subword tokenization, token learner & segmenter
Many NLP systems need to deal with unknown words. To deal with this problem, modern tokenizers automatically induce sets of tokens that include tokens smaller than words, called subwords.
Subword tokenization schemes have two parts:
Subword tokenization: Byte-pair encoding (BPE) tokenization
The BPE token learner is usually run inside words (not merging across word boundaries).
The algorithm iterates through the following steps:
After several iterations (k), BPE learns:
The BPE token segmenter runs on test data the merges we have learned. Merges are applied:
Stop words removal, stemming and lemmatization
Substitution operator
In NLP, the substitution operator refers to the process of replacing one word or phrase with another word or phrase.
Usually done through REs.