lecture 3 Flashcards

Question 1

Q

corpus

Answer

A

large/complete collection of writing
linguistics. a body of utterances, as words or sentences, assumed to be representative of and used for lexical, grammatical, or other linguistic analysis

Question 2

Q

properties of corpora

Answer

A

naturally occurring corpora serve as realistic samples of language
metadata: side information about where the samples come from
corpora with linguistic annotations

Question 3

Q

markup (formatting)

Answer

A

common formats for structuring linguistic data

–> e.g., XML, CoNLL, JSON

Question 4

Q

preprocessing

Answer

A

transforming text into a useful format

–> e.g., tokenization

Question 5

Q

splitting sentences into words/tokens

Answer

A

using spaces as delimiters
tokenization
stemming
lemmatization

Question 6

Q

tokenization

Answer

A

each word, including punctuation, is considered a separate token and assigned a sequential position

Question 7

Q

stemming + process

Answer

A

consider words within sentences that share a common root, but appear in different forms

reduce word to base/root form
–> this enables us to treat them as a single unit for analysis

process:
1. tokenize sentences
2. use stemming to reduce words into common stems

Question 8

Q

lemmatization

Answer

A

reduce words to common lemma’s

takes context into account to convert words to their base dictionary form

results in more linguistically correct forms than stemming

Question 9

Q

bag of words

Answer

A

text is represented by the frequency of its words, disregarding grammar and word order

considers their occurrences as features for analysis

simplifies text into numerical data

Question 10

Q

BOW process

Answer

A

text is converted to list of words
words are placed into metaphorical bag
each word’s frequency is counted
frequency count is represented in table format

Question 11

Q

n-gram

Answer

A

word sequence of length n
the more words, the more information is revealed
compute probabilities of these sequences

Question 12

Q

n-gram probabilities

Answer

A

calculation: occurrence of the sequence / size of the corpus

words in a corpus could appear multiple times. this affects the calculation of probabilities and the overall model we create

it is therefore essential that n-grams model the probability distribution of all possible sequences of all tokens in the corpus

Question 13

Q

what do n-gram models evaluate

Answer

A

the probability of a sequence of n tokens

this is useful for determining the likelihood of specific word sequences and is represented by the joint probability

Question 14

Q

chain rule

Answer

A

product of conditional probabilities

indicates that the probability of the entire sequence is the product of the individual probabilities of each word given all the preceding words

this allows us to systematically compute the probability of a sequence by considering one word at a time, given the context of all previous words in the sequence

lecture 3 Flashcards

(14 cards)