Social media problem
Social media solutions
Option 1 = Manual annotation: expensive, never ends
• Option 2 = Use model (uptraining/selftraing): quality questionable
• Option 3 = adapt through language modeling (see lecture 3) (Van der Goot(2017) ->
train word2vec on twitter data, train POS tagger initialized with word embeddings on
annotated news data)
• Option 4 = inject „noice“ into annotated training data: less explored
• Using n-gram based model = limited to „known“ n-grams
• Solution: use character n-grams
• More like standard data by normalizing
• Issue: often dont want to normalize because info lost (e.g. capital letter means smth),
highly subjective what is „standard“
• Lexical normalization (= Word level normalization)
= task of transforming an utterance into its standard form, word by word, including
both 1-to-many and many-to-1 replacements (Allow split word into 2 or merge 2
words but not switch words or order)
o Benchmark: LexNorm 519 anno
How to solve normalisation?
Normalisation evaluation
Err
Einstein: Used Solutions & Critic
• I argue that the two main computational approaches to dealing with bad language —
normalization and domain adaptation — are based on theories of social media
language that are not descriptively accurate.
1. Normalization: adapting text to fit the tools
o The logic of normalization presupposes that the “norm” can be identified
unambiguously, and that there is a direct mapping from non-standard words
to the elements in this normal set.
o Normalization is often impossible without changing the meaning of the text.
2. Domain adaption: adapting tools to fit the text
o E.g. preprocessing, new annotation schemes
o By adopting a model of “domain adaptation,” we con- fuse a medium with a
coherent domain. Adapting language technology towards the median Tweet
can improve accuracy on average, but it is certain to leave many forms of
language out.
o Social media is not coherent domain; Twitter itself is not a unified genre, it is
composed of many different styles and registers
Einstein: Lexical coherence of social media
• The internal coherence of social media — and its relationship to other types of text —
can be quan- tified in terms of the similarity of distributions over bigrams. The
relationship between OOV rate and do- main adaptation has been explored by
McClosky et al. (2010), who use it as a feature to predict how well a parser will
perform when applied across domains
Einstein: Oov