Web search: Challenges compared to ‘traditional’ IR
Types of web queries
Web queries: Importantz
Precision often more important than recall
Query language must be lighweight (mostly phrase uqeries)
The web graph
We refer to the hyperlinks onto a page as in-links/indegree, and out- degree of a web page to be the number or links out of it.
Strongly connected: If there are pairs of pages such that one cannot proceed from one page to of the pair to the other by following hyper- links, these are not strongly connected.
Spamdexing
Manipulation of web content to appear artificially high on search results for particular keywords.
Common spamming techniques:
Counter measures:
Web crawling and indexing
Is the process by which we gather pages from web to index them and support a seach enginge. The objective of crawling is to quickly and efficiently gather as many useful web pages as possible.
A crawler must be:
Robust
The web contains servers that create spider traps, which are genrators of web pages that misled creawlers into getting stuch fetching an infinite number of pages in a particular domain.
Polite
Avoid flooding servers and only crawl allowed pages
A crawler should be
Distributed
Should have the ability to execute in a distributed fashion across multiple machines -
Scalable
The architecture should permit scaling up the crawl rate by adding extra machines and bandwith
Quality
Must be biased toward fetching useful pages. The fetched page is parsed, to extract both the text and the links from the page. The extracted text is fed to a text indexer. The extracted URLs/links are then added to a URL frontier, which at all times consists of URLs whose corresponding pages have yet to be fetched by the crawler. This process may be viewed as traversing the web graph.
Crawling architecture
A crawler must crawl webpages thar are of high quality and/or are frqe- untly updated more often.
URL frontier
The URL frontier have a front queue and back queue to take into the consideration the polite and prioritizing effects. Its goals are to ensure that
The two major modules are a set of F front queues and a set of B back queues. Both FIFO queues. The front queue implements the prioritization and back queue the politeness.
Web indexing
Two types of index partitioning:
Link analysis
The study on link analysis builds on two intuitions:
PageRank
The most well-known algorithm for ranking the quality of webpages accord- ing to their link structure (used by Google). It assigns a numberical score between 0 and 1 to each page, known as its PageRank. Given a web-seach query, the search enginge will give each web-page a score where amongst others, PageRank is used.
Given a web-surfer, he can swich to a new node (webpage) in two ways:
In assigning a PageRank score to each node, we use the teleport operation in two ways:
Markov Chain
A random walk in a web-graph can be represented with a Markov Chain. It is an N x N transition probability matrix P. Every index should be a number between 0 and 1, and the sum of all rows should be 1. The entry Pij tells us the probability that the state at the next time-step is j, condidtioned on the current state being i. Each entry Pij is known as a transition probability and depends only on the current state i (i = row, j = column). This is known as the Markov property.
How to compute the matrix P
One state for each web-page, and each transition probability represent- ing the probability of moving from one web page to another. If there is a hyperlink from page i to page j, then Aij = 1. Otherwise Aij = 0. We can derive P from 1. If a row of A has no 1s, then divide each element by 1/N. For all other rows, proceed as follows:
The PageRank computation
The probability distribution of the surfers position at any time is the probability vector x. The probability distribution at t = 1 (started at t = 0) is given by the probability vector xP, at t = 2 by (xP)P = xP and so on.
The RageRank is query-independent (a static quality score). On the other hand, the relative ordering of pages should, intuitvely, depend on the query being serverd. For this reason, search engines use static quality measures such as PageRank just one of many factors in scoring a web page.