PageRank Algorithm

PageRank

EN CN

PageRank (PR) is an algorithm used by Google Search to rank web pages in their search engine results. PageRank was named after Larry Page, one of the founders of Google. PageRank is a way of measuring the importance of website pages. According to Google: PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

Currently, PageRank is not the only algorithm used by Google to order search results, but it is the first algorithm that was used by the company, and it is the best known. As of September 24, 2019, PageRank and all associated patents are expired.

Computation

PageRank can be computed either iteratively or algebraically. The iterative method can be viewed as the power iteration method or the power method. The basic mathematical operations performed are identical.

\[PR(p_i;t+1) = \frac{1-d}{N} + d \sum_{p_j \in M(p_i)} \frac{PR (p_j; t)}{L(p_j)}\]

where d is the damping factor,or in matrix notation

\[\mathbf{R}(t+1) = d \mathcal{M}\mathbf{R}(t) + \frac{1-d}{N} \mathbf{1}\]

where \(\mathbf {R}_{i}(t)=PR(p_{i};t)\) and \(\mathbf {1}\) is the column vector of length N containing only 1.

The matrix \(\mathcal {M}\) is defined

\[\mathcal{M}_{ij} = \begin{cases} 1 /L(p_j) , & \mbox{if }j\mbox{ links to }i\ \\ 0, & \mbox{otherwise} \end{cases}\]

Markov Chain

Markov Chain A Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. In continuous-time, it is known as a Markov process. It is named after the Russian mathematician Andrey Markov.

Markov chains have many applications as statistical models of real-world processes, such as studying cruise control systems in motor vehicles, queues or lines of customers arriving at an airport, currency exchange rates and animal population dynamics.

Markov processes are the basis for general stochastic simulation methods known as Markov chain Monte Carlo, which are used for simulating sampling from complex probability distributions, and have found application in Bayesian statistics and artificial intelligence.

The adjective Markovian is used to describe something that is related to a Markov process

Source code

import numpy as np
from scipy.sparse import csc_matrix


def pageRank(G, s=.85, maxerr=.0001):
    """
    Computes the pagerank for each of the n states
    Parameters
    $$R = ( d M + \frac { 1 - d } { n } E ) R = A R$$
    ----------
    G: matrix representing state transitions
       Gij is a binary value representing a transition from state i to j.
    s: probability of following a transition. 1-s probability of teleporting
       to another state.
    maxerr: if the sum of pageranks between iterations is bellow this we will
            have converged.
    """
    n = G.shape[0]

    # transform G into markov matrix A
    A = csc_matrix(G, dtype=np.float)
    #Retrun one dimensional array that the count of every rows' nonzero numbers
    rsums = np.array(A.sum(1))[:, 0]
    #return the indexs of nonzero which include (row,col)
    ri, ci = A.nonzero()
    
    A.data /= rsums[ri]
    # bool array of sink states
    sink = rsums == 0

    # Compute pagerank r until we converge
    ro, r = np.zeros(n), np.ones(n)
    while np.sum(np.abs(r - ro)) > maxerr:
        ro = r.copy()
        # calculate each pagerank at a time
        for i in range(0, n):
            # inlinks of state i
            Ai = np.array(A[:, i].todense())[:, 0]
            # account for sink states
            Di = sink / float(n)
            # account for teleportation to state i
            Ei = np.ones(n) / float(n)

            r[i] = ro.dot(Ai * s + Di * s + Ei * (1 - s))

    # return normalized pagerank
    return r / float(sum(r))
if __name__ == '__main__':
    #adjacency matrix
    G = np.array([[0,0,1,0,0,0,0],
                  [0,1,1,0,0,0,0],
                  [1,0,1,1,0,0,0],
                  [0,0,0,1,1,0,0],
                  [0,0,0,0,0,0,1],
                  [0,0,0,0,0,1,1],
                  [0,0,0,1,1,0,1]])
    print(pageRank(G,s=.86))

References

PageRank

任意含有N个结点的有向图上,可以定义一个随机游走模型,即一阶马尔可夫链,转移矩阵由两部分的线性组合组成,其中一部分按照转移矩阵\(M\),从一个结点到连接出的所有结点的转移概率相等,另一部分按照完全随机转移矩阵,从任一结点到任一结点的转移概率都是\(1/n\)。这个马尔可夫链存在平稳分布,平稳分布向量R称为这个有 PageRank向图的一般,满足\(R = d M R + \frac { 1 - d } { N } {1}\)

PR值是介于0和1之间的一个数,得分越大表示网页越重要

其中\(d ( 0 \leq d \leq 1 )\)是阻尼因子,1是所有分量为1的N维向量。

阻尼系数 d(damping factor),并声明 d=0.85, 其意义是:任意时刻,用户访问到某页面后继续访问下一个页面的概率,相对应的 1-d=0.15 则是用户停止点击,随机浏览新网页的概率。d 的大小由一般上网者使用浏览器书签功能的频率的平均值估算得到。 加入阻尼系数d可以有效的防止网页黑洞, 如果有一个网页没有链出(即没有投票贡献),则所有的pr值都会被它吸引,最终pr的概率分布趋于0向量

对于某个页面i,其对应PR值大小的计算公式如下:

\[{\rm {PR}}(p_{i})={\frac {1-d}{N}}+d\sum _{p_{j}\in M(p_{i})}{\frac (p_{j})}{L(p_{j})}}\]

这里,\(p_{1},p_{2},...,p_{N}\)是目标页面\(p_{i}\) ,\(M(p_{i})\)是链入\(p_{i}\)页面的集合,\(L(p_{j})\)是页面\(p_{j}\)链出页面的数量,而\(N\)是集合中所有页面的数量。 集合中所有页面的PR值可以由一个特殊的邻接矩阵的特征向量表示。这个特征向量R为:

\[\mathbf {R} ={\begin{bmatrix}{\rm {PR}}(p_{1})\\{\rm {PR}}(p_{2})\\\vdots \\{\rm {PR}}(p_{N})\end{bmatrix}}\]

同时,R也是下面的方程组的解:

\[\mathbf {R} ={\begin{bmatrix}{(1-d)/N}\\{(1-d)/N}\\\vdots \\{(1-d)/N}\end{bmatrix}}+d{\begin{bmatrix}\ell (p_{1},p_{1})&\ell (p_{1},p_{2})&\cdots &\ell (p_{1},p_{N})\\\ell (p_{2},p_{1})&\ddots &&\\\vdots &&\ell (p_{i},p_{j})&\\\ell (p_{N},p_{1})&&&\ell (p_{N},p_{N})\end{bmatrix}}\mathbf {R}\]

这里的邻接函数(adjacency function) \(\ell (p_{i},p_{j})\) 代表“从页面j指向页面i的链接数”与“页面j中含有的外部链接总数”的比值 如果\(p_{j}\)不链向\(p_{i}\),则前面提到的“从页面j指向页面i的链接数”为零。将情况一般化:对于特定的j,应有:

\[\sum _{i=1}^{N}\ell (p_{i},p_{j})=1,\]

马尔可夫链

Markov Chain 马尔可夫链(英语:Markov chain),又称离散时间马尔可夫链(discrete-time Markov chain,缩写为DTMC[1]),因俄国数学家安德烈·马尔可夫得名,为状态空间中经过从一个状态到另一个状态的转换的随机过程。该过程要求具备“无记忆”的性质:下一状态的概率分布只能由当前状态决定,在时间序列中它前面的事件均与之无关。这种特定类型的“无记忆性”称作马尔可夫性质。马尔科夫链作为实际过程的统计模型具有许多应用。

在马尔可夫链的每一步,系统根据概率分布,可以从一个状态变到另一个状态,也可以保持当前状态。状态的改变叫做转移,与不同的状态改变相关的概率叫做转移概率。随机漫步就是马尔可夫链的例子。随机漫步中每一步的状态是在图形中的点,每一步可以移动到任何一个相邻的点,在这里移动到每一个点的概率都是相同的(无论之前漫步路径是如何的)

其中一个重要的概念就是平稳分布, 给定一个初始状态, 经过马尔可夫链n步之后趋向于一个固定概率分布,即 \(\pi {\mathbf {P}}=\pi .\,\) \(\{\pi_{i}\}\)是平稳分布

PageRank算法的缺点

  1. 没有区分站内导航链接。
  2. 没有过滤广告链接和功能链接(“分享到微博”)。
  3. 对新网页不友好(新网页的链入少)。

源码

import numpy as np
from scipy.sparse import csc_matrix


def pageRank(G, s=.85, maxerr=.0001):
    """
    计算每个网页的pr值
    $$R = ( d M + \frac { 1 - d } { n } E ) R = A R$$
    ----------
    G: 表示邻接矩阵
	Gij是一个二进制值,表示从i 网页到j网页是否有链接,有则是1, 没有为0
    s: 阻尼因子. 1-s 随机浏览新网站的概率.
    maxerr: 迭代多次之后,pageranks收敛趋于稳定
    """
    n = G.shape[0]

    # 按列压缩矩阵
    A = csc_matrix(G, dtype=np.float)
    #返回每列非零元素的个数的一维矩阵(表示某个网页节点收到其它网页节点的投票/链入数)
    rsums = np.array(A.sum(1))[:, 0]
    #返回A矩阵中非零元素下标(rows, cols)
    ri, ci = A.nonzero()
    #计算转移概率,成为转移矩阵
    A.data /= rsums[ri]
    # bool array of sink states
    sink = rsums == 0

    # 计算pagerank 直到收敛
    ro, r = np.zeros(n), np.ones(n)
    while np.sum(np.abs(r - ro)) > maxerr:
    	#初始化为一阶矩阵元素全为1
        ro = r.copy()
        # 计算每一列的pagerank
        for i in range(0, n):
            # 把压缩稀疏矩阵解压后提取i列的概率列矩阵
            Ai = np.array(A[:, i].todense())[:, 0]
            # account for sink states
            Di = sink / float(n)
            # 单位矩阵/n
            Ei = np.ones(n) / float(n)
            #得出点积
            r[i] = ro.dot(Ai * s + Di * s + Ei * (1 - s))

    #
    return r / float(sum(r))

if __name__ == '__main__':
    #邻接矩阵G
    G = np.array([[0,0,1,0,0,0,0],
                  [0,1,1,0,0,0,0],
                  [1,0,1,1,0,0,0],
                  [0,0,0,1,1,0,0],
                  [0,0,0,0,0,0,1],
                  [0,0,0,0,0,1,1],
                  [0,0,0,1,1,0,1]])
    print(pageRank(G,s=.86))

应用

  1. 面向Twitter的分析系统研究
  2. 微博舆情分析中信息转发路径提取方法研究

参考