W4 Vector Space Model Flashcards

Question 1

Q

What is Ranked Retrieval?
Formalization Definition of Ranked Retrieval

Answer

A

Also called Relevance Ranking. The system returns an ordering of the (top) documents in the collection for a query.
For each query Qi and doc Dj, compute a score RSV(Qi, Dj)
* RSV: Retrieval Status Value
* RSV theory neural, could be distance based, probability based
* Ordinal scale (supports > < =)
* Value increase with relevance

Question 2

Q

1 What’s set-based approach for Ranked Retrieval?

Answer

A

Query and doc are each a set of terms

Question 3

Q

Give the calculation of Jaccard Coefficient

Answer

A

jaccard (A,B) = |A ∩ B| / |A ∪ B|

|A ∩ B|: Commonality
|A ∪ B|: Length Normalization
A and B don’t have to be the same size
We always assigns a number from [0, 1]

RSV(Q,D) = jaccard(terms(Q), terms(D))

Question 4

Q

What’s the query-doc match score computed by jaccard coefficient for each of the 2 docs: (common stop words are removed):

Query: Ides of March -> {ide, march}
Doc 1: Caesar died in March -> {caesar, die, march}
Doc 2: the long march -> {long, march}

Answer

A

jaccard (A,B) = |A ∩ B| / |A ∪ B|

jaccard ({ide, march}, {caesar, dy, march}) = 1 / 4
jaccard ({ide, march}, {long, march}) = 1 / 3

Question 5

Q

Give calculation of:
Jaccard Distance?
Dice coefficient?

Answer

A

Jaccard Distance = 1 - Jaccard Similarity

Dice(A, B) = 2x|A ∩ B| / (|A| + |B| )=2J/(J+1)

Question 6

Q

What is a metric?

Answer

A

In fact a distance function.
* identity: d(x, y)=0 => x=y
* Symmetry: d(x, y) = d(y, x)
* Triangle inequality: d(x, z) <= d(x, y) + d(y, z)
* Non negativity: d(x, y) >= 0

Question 7

Q

What are the issues with Jaccard for scoring?

Answer

A

we need to normalize for length (of the document or query)
Jaccard does not consider term frequency
Jaccard does not consider that rare tems in a collection are more informative than frequent terms

Question 8

Q

2 What’s term-weighting approach for Ranked Retrieval?

Answer

A

Measure term importance in docs
Vector Space Model (VSM)

Question 9

Q

What is bag of words model?

Answer

A

multiset of words, we don’t care the ordering in a doc.
john loves mary
mary loves john
they have the same vectors
The positional index can distinguish two docs

Question 10

Q

What is term frequency (TF)?

Answer

A

The term frequency tf (t,d) of term t in document d is defined as
the number of times that t occurs in d.

raw term frequency is not what we want because (Relevance does not increase proportionally with term frequency) so we use log normalization:

The log frequency weight of term t in d is:
w_(t, d) = 1 + log(tf) if tf>0
= 0 else
w is the quantitative evidence of d being relevant for query t

score a doc for a query: sum the w

Question 11

Q

What is document frequency (DF)? How do we interpret it?

Answer

A

df(t) is the document frequency of t: the number of
documents that contain t

Rare terms are more informative than frequent terms (eg. stop words and the word arachnocentric in a doc) A document containing arachnocentric is more likely to be relevant than a document that does not contain it

Question 12

Q

What is Inverse Document Frequency (IDF)?
How to calculate TF-IDF?

Answer

A

idf = log (N/df)

N: number of docs in the collection
idf quantifies how surprised we are to see term t in a doc, again, logarithm is used for better approximation

TF-IDF = TF * IDF
TF-IDF increases with number of occurances within the doc, and with rarity of the term in the collection.
TF-IDF is evidence of d being relevant when looking for t

Question 13

Q

Document ranking in VSM

Answer

A

Docs are vectors of term weights in the space

The vector space is |V|-dimensional (|V|: no. of terms in the vocabulary)
terms are axes of the space

***** Vectors are High-dimensional, sparse

Question 14

Q

Queries as vectors too, how do we rank?

Answer

A

represent them as vectos in space, rank docs according to their proximity to the query in the space = similarity of the vectors

The angle between vectors shows the similarity

Question 15

Q

Give formula for cosine similarity (dot product) of two vectors

Answer

A

cos(q, d) = (q * d) /(|q|*|d|)
= \sum (qi * di) / \sqrt (\sum qi^2) * \sqrt (\sum di^2)

q, d: length-normalized: ||x|| = \sqrt (\sum xi^2)

Question 16

Q

@#@ LET OP @#@

The calculation in this chapter is very important and will present in the exam therefore please refer to the slides for examples and do the calculations yourself