Describe the task of tokenisation
Break down a piece of text into individual words
What is the initial approach to tokenisation
whitespace
Where does ambiguity come from in tokenisation
end of sentences “mary.”, abbreviations “dr.”, punctuation “they’re”
What are the three classes of token
morphosyntactic word, punctuation or symbol, number
What is the challenge of character encoding
Do we choose ascii only or unicode to include emojis.
What are the challenges of tokenisation (7)
What challenges arise from domain dependence (3)