what is a corpus
A collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language.
we can classify corpora by (6)
problem with tokenization
& solved by:
there is ambiguity = solved by POS-tagging
what is POS tagging?
identification of word class
word class determined by:
(thus context needed for tagging)
zipfs law is F(z) = |C|/z&^a
what are these things?
f is the frequency of a word (token) z divided by the rank of the word. so word of rank 2 should occur 1/2 times the frequency of the word of rank 1.
what are tokens
statistical units of study (usually words)
what are types
the unique! words
give 2 purposes of corpora
- analytical: empirical basis on the distribution of constructions and language phenomena
a balanced corpus needs to be (5)
what is the link level in parallel corpora?
at what level are the languages linked. sentence level or word level.