In this article we are going to discuss Google’s PageRank algorithm. So what is this all about? The original problem is that we want to rank websites in their search engine results. We can do this either with HITS (Hypertext Induced Topic Search) or PageRank algorithm.
PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites”
Graph representation of the WWW
First of all we have to construct a model. How to model the entire World Wide Web (WWW)? We can do it with the help of a directed graph. The WWW hyperlink structure forms a huge directed graph. The nodes represent the web pages and the directed edges are the hyperlinks. There are several types of links:
- inbound links: links pointing to the given website from other pages
- outbound links: links pointing from the given page to other webpages
- dangling links: links pointing to any page with no outgoing links
Crawling the web with BFS
OK, we know how to model the WWW with a huge directed graph. But how to find the nodes (pages) and edges (hyperlinks)? We have to detect the websites and hyperlinks accordingly. We can do it with graph traversal algorithms: breadth-first search or depth-first search. These algorithms are called web-crawlers. They crawl the web and detect websites and connections between them.
The original formula
OK, we know how to represent the WWW and how to crawl it with the relevant graph algorithms. So how to calculate page rank for a given website?
As you can see, a given A website page rank depends on other websites’ page rank that are linking to A. C(T) parameter is the number of outgoing links as far as website T is concerned. d is the damping factor: a value in the range 0 and 1. We have to initialize the page ranks at the beginning. The usual approach is to initialize every page rank to be 1/n. n is the total number of pages / websites on the web (a huge number).
The formula in the previous chapter is working fine. We update the values one by one. We can use matrix operations in order to do multiple calculation at the same time. So we can construct a so called “transition-matrix” out of the directed graph. This matrix is a column-stochastic one: the sum of the entries by column yields 1. We can use matrix-vector multiplication to end up with the final page ranks.
Random surfer model
We have mentioned that the importance of a webpage is measured by its popularity. So how many incoming links it has. Page rank can be defined by the probability of a random surfer on the web starting on a random page. This surfer follows hyperlinks and visits the webpages accordingly. It is similar to Markov-chains. We have to define the transition-matrix again. The entries of this matrix are the probabilities that the surfer takes that given hyperlink. The stationary distribution of the matrix is a vector. This vector contains the page ranks for the webpages.
This model is not able to cope with disconnected components as well as dangling nodes (nodes that has only incoming links). So we have to modify the formula a little bit.
This is the final formula we are looking for. M is the PageRank-matrix or “Google-matrix”. It is a column-stochastic matrix as well. The d damping factor is crucial. A random surfer follows the links on the actual website with probability (1-d). Sometimes – with probability d – the surfer leaves the actual page and navigates to another URL. This is called teleportation.