information retrieval
finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections
how do IR systems support the search process?
how does a search engine work? (basic)
how do recommender systems work? (basic)
4 principles of IR
principles of
- relevance
- ranking
- text processing
- user interaction
principles of relevance
evaluation of relevance
needed:
- a set of queries
- a document collection
- relevance assessment: set of documents for each query that are labelled as relevant or non-relevant
the retrieval system returns a ranking of all documents
principles of ranking
machine learning for ranking
Idea: learn the relevance of a document based on human-labelled training data (relevance assessments)
Why is machine learning for ranking different from machine learning for classification?
Relevance depends on the query, so we cannot train a global classifier over all relevant and irrelevant documents in a labelled dataset
=> need a machine learning paradigm that includes the query
two-stage retrieval first stage
two stage retrieval second stage
2 different relavance models for similarity
index time
query time
term-based retrieval models
problem with term-based retrieval
a vocabulary mismatch between query and document
solution: semantic matching, based on embeddings representations of texts
the role of users in IR
challenges with user interaction