PageRank

1. PageRank

The infamous PageRank algorithm was introduced by Page and Brin in 1999 and is the basis of the search engine algorithm Google uses to rank web pages. Let's see how it works and how it can be used in social networks.

2. The PageRank algorithm

The idea of ranking web pages for a search engine means estimating the probability of visiting each one. The PageRank algorithm achieves this by simulating surfing behavior and assuming that surfers click on one of the links on the current webpage to get to the next one. The web pages and the links between them, therefore, form a huge network. To reach node J in the network here, the surfer has to be at either H or I. In the former case he has one-third probability of going to J and if he is on I then the probability that he goes to J next is 25%. In addition, the ranking of J depends on the ranking of H and I, since it indicates how likely the surfer is to be on H or I in the first place. This means that the PageRank of a webpage depends on two components, namely the PageRank of the web pages that link to it and the number of links the linking web pages have. To make the surfing simulation more realistic, the surfer can, in addition to clicking on one of the links on the webpage, randomly visit any other webpage in the network as indicated by the red arrow in the figure here. The probability of this random jump is called restart probability and is indicated by one minus alpha. Here you can see how to compute the PageRank of J. For each node that has a link to J, we multiply the likelihood of going to J with the PageRank of that node, here H and I, and sum up the values. This is multiplied by alpha, the probability of following links. To this we add the restart factor, one minus alpha times the restart value of node J, indicted by e_J here. Typically it is assumed to be uniformly distributed over all web pages.

3. The PageRank algorithm

Here you can see the PageRank formula to compute PageRanks for all nodes simultaneously. It requires A the adjacency matrix of the network. Furthermore, e is the restart vector, that indicates each webpage's likelihood in the case of a random jump. The default value for alpha is 0.85 which means that the surfer has 85% likelihood of clicking on links and 15% likelihood of doing a random jump. The equation can be solved either iteratively, updating the PageRanks in each step using the PageRanks of the previous step, or deterministically using matrix inversion. It is quite easy to compute the PageRank in R using the igraph package. You simply apply the `page.rank` function to the network as seen here. The result is a list of values, and the first of them is the vector of PageRanks.

4. Personalized PageRank

Personalized PageRank is a version of the algorithm where the restart vector has been customized, giving more priority to some nodes. So when a jump happens, we can steer it towards certain nodes. For example, in the network of data scientists, A might often have very good ideas for new courses and thus influence those around him. In the R code, we add the `personalized` argument where A has value 1 and the others have value 0. As a result, nodes that are closer to A will be ranked higher because when the random jump happens, it will jump to A.

5. Let's practice!

Now let's compute some PageRanks for the churn network.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.