why do we need to evaluate
what do we need to evaluate
what is the IR experimental set up
maintain a test collection of docs, queries and relevance assessments using Ground truth
- measure of performance of precision, recall
- systems to compare for query TF vs TF-IDF
- experimental design
what are the assumptions for the evaluation
system provides a ranked list after searching the query
- a better system will provide a better ranked list
- a better ranked list generally satisfies the users
what is precision
retrieved docs that are relevant / all retrieved docs
what is recall
retrieved docs that are relevant / all relevant docs
ranking effectiveness
what are the 3 methods of summarising ranking
what is mean average precision (MAP)
summarise rankings from multiple queries by averaging average precision
- assume user is interested in finding many relevant documents for each query
- requires many relevance judgments in test collection
recall precision graphs
cannot show pattern
interpolation
defines precision at any recall level as the maximum precision observed in any recall-precision point at a higher recall level
joining average precision points at standard recall levels
…