1 What’s set-based approach for Ranked Retrieval?
Query and doc are each a set of terms
Give the calculation of Jaccard Coefficient
jaccard (A,B) = |A ∩ B| / |A ∪ B|
|A ∩ B|: Commonality
|A ∪ B|: Length Normalization
A and B don’t have to be the same size
We always assigns a number from [0, 1]
RSV(Q,D) = jaccard(terms(Q), terms(D))
What’s the query-doc match score computed by jaccard coefficient for each of the 2 docs: (common stop words are removed):
Query: Ides of March -> {ide, march}
Doc 1: Caesar died in March -> {caesar, die, march}
Doc 2: the long march -> {long, march}
jaccard (A,B) = |A ∩ B| / |A ∪ B|
jaccard ({ide, march}, {caesar, dy, march}) = 1 / 4
jaccard ({ide, march}, {long, march}) = 1 / 3
Give calculation of:
Jaccard Distance?
Dice coefficient?
Jaccard Distance = 1 - Jaccard Similarity
Dice(A, B) = 2x|A ∩ B| / (|A| + |B| )=2J/(J+1)
What is a metric?
In fact a distance function.
* identity: d(x, y)=0 => x=y
* Symmetry: d(x, y) = d(y, x)
* Triangle inequality: d(x, z) <= d(x, y) + d(y, z)
* Non negativity: d(x, y) >= 0
What are the issues with Jaccard for scoring?
2 What’s term-weighting approach for Ranked Retrieval?
Measure term importance in docs
Vector Space Model (VSM)
What is bag of words model?
multiset of words, we don’t care the ordering in a doc.
john loves mary
mary loves john
they have the same vectors
The positional index can distinguish two docs
What is term frequency (TF)?
The term frequency tf (t,d) of term t in document d is defined as
the number of times that t occurs in d.
raw term frequency is not what we want because (Relevance does not increase proportionally with term frequency) so we use log normalization:
The log frequency weight of term t in d is:
w_(t, d) = 1 + log(tf) if tf>0
= 0 else
w is the quantitative evidence of d being relevant for query t
score a doc for a query: sum the w
What is document frequency (DF)? How do we interpret it?
df(t) is the document frequency of t: the number of
documents that contain t
Rare terms are more informative than frequent terms (eg. stop words and the word arachnocentric in a doc) A document containing arachnocentric is more likely to be relevant than a document that does not contain it
N: number of docs in the collection
idf quantifies how surprised we are to see term t in a doc, again, logarithm is used for better approximation
Document ranking in VSM
Docs are vectors of term weights in the space
The vector space is |V|-dimensional (|V|: no. of terms in the vocabulary)
terms are axes of the space
***** Vectors are High-dimensional, sparse
Queries as vectors too, how do we rank?
The angle between vectors shows the similarity
Give formula for cosine similarity (dot product) of two vectors
cos(q, d) = (q * d) /(|q|*|d|)
= \sum (qi * di) / \sqrt (\sum qi^2) * \sqrt (\sum di^2)
q, d: length-normalized: ||x|| = \sqrt (\sum xi^2)
@#@ LET OP @#@
The calculation in this chapter is very important and will present in the exam therefore please refer to the slides for examples and do the calculations yourself
tf, df, idf, TF-IDF, length normalization, dot product, cosine similarity
Summary: Vector Space ranking with term weighting
scope vs. verbosity hypothesis
scope: a long document consists of a number of unrelated short documents concatenated together
verbosity: a long document covers a similar scope to a short document, but uses more words
Criterias for comparing IR models/systems
effectiveness, efficiency, explainability, parsimony (Occam’s Razor)