The Theory Behind Google’s Theme-Based or Thematic Ranking Algorithm

SEO experts for many years now have considered the significance of theme-based ranking algorithms. There are large number of theoretical approaches that address the incorporation of thematic information as a ranking criterion for search engines. What they al have in common is the fact that they consider not only the web page’s content, but also its context – that is, what it has in common with linked web pages. Put another way, the contents of an entire website will have an influence on the ranking of a single page on the website. The takeaway here is that a single page’s ranking is based on its content and also on the pages that link to it, and those it links to.

It’s a little bit controversial to discuss the mechanics of implementation of a thematic-based ranking in the Google algorithm, because it’s practically impossible to determine the precise nature of the algorithm. SEO experts debate this back and forth all the time, but one working hypothesis, which we’ll treat here, is the suggestion that inbound links from similar pages have a more significant effect on its PageRank than do inbound links from sites that are unrelated. We’ll take a look at two approaches to incorporating this idea. The first was developed by Richardson and Domingos – this is the notion of the “intelligent surfer”. Another approach is called Topic-sensitive PageRank, developed by Haveliwala.

Intelligent Surfer

Richardson and Domingos started with the random surfer model as a jumping-off point to explain their approach to a thematic ranking modification to the PageRank algorithm. The basic idea is to consider an intelligent surfer who only follows links related to the original search query, and only jumps to new pages that are related to that query after getting bored. This is a dramatic departure from the random surfer model, but it is an effective, realistic one that should produce huge leaps in the success of the model. This means only pages that contain the original search term are relevant to the intelligent surfer. The question that naturally arises is how a user’s behavior should influence the algorithm, since the algorithm only considers the pages themselves. The solution was to initiate calculations whenever a term is encountered in the page, and to only consider links between pages that contain the search query. This is good in theory, but the computation of the PageRank is difficult this way. Some search terms that don’t occur often introduce issues, for instance, since the term has to appear on pages that link to it. This produces questionable search results for many terms, which can be problematic.

There’s an additional problem of scalation. The model requires about 100 – 200 times as much time to run as the original Pagerank algorithm, which is a huge downside in a world where caluclations must be run many, many times. It’s possible to run the algorithm in the real world realistically, but it is notable slower than the original.

More memory is required to run the algorithm, as well, but this is less of an issue than the computational time required because the memory required for PageRank is relatively low to begin with. the main issue is the computational time required. If a PageRank calculation takes five hours to run, the intelligent surfer model would require around three weeks. This is a prohibitive increase, and unless the algorithm is applied on a supercomputer to a relatively consistent set of data, it won’t run as effectively as PageRank original.

Topic-Sensitive Ranks

Havilewala has developed a model called topic-sensitive page ranking that is less prohibitive in terms of computational resources. Like the intelligent surfer model, Havilewala’s model seeks to incorporate different tanks for different terms. The difference is that the topic sensitive PageRank does not require a ridiculous number of pageRanks for different search queries. Rather, it relies on a few ranks for differing topics, which makes it significantly more realistic to apply in the real-world. It’s based on the link structure of the world wide web, with a different weighting for each topic that could be considered.

The basic idea is to influence the PageRank manually. We add a function E(A) to the PageRank algorithm:

PR(A) = E(A) (1-d) + d (PR(T1)/C(T1) + … + PR(Tn)/C(Tn))

. . . As in the Yahoo bonus discussion. Havilwala goes one step further, though. He assigns a different initial value based on different topics. Each topic has an associated authority page, and on the basis of this evaluation, unique PageRanks are calculated, separately, but for the entire world wide web.

Haveliwala chose the 16 top categories of the ODP to start identifying topics that are important and to plan the PageRank intervention. He assigns a higher initial value to the ODP categories for which his algorithm calculates PageRank. As an example, if the algorithm calculates a PageRank for the topic of music, every ODP page in the category receives a higher initial value, which is propagated through the graph via PageRank. This allows us to calculate higher PageRanks for given topics on the fly.

Of course, using ODP is a bit limiting when it comes to identifying topics. Some of the problems include a dependency on ODP editors and an only preliminary division into topics. Significantly, none of these issues are beyond salvation, and really just beg refinement rather than replacemet. This shows that the topic-sensitive algorithm is promising and practical, making it a go-to for most modern researchers.

One of the most important aspects of the topic-sensitie PageRank is the fact that the user’s preferences are considered. A thematic ranking algorithm accomplishes nothing if we don’t know what the user is looking for. The major drawback is that, without this knowledge, the topic-sensitive algorithm can’t function properly.

A few ways to acquire this knowledge have been proposed. The first is to highlight terms on a web page, which allows us to see what on the web page a user might be looking for. The Google Toolbar or a Google Page Rank checker are also useful here, as it submits data regarding queries and density of those query terms on pages that a user visits via Google. This can be used to create regularly updated profiles on users that incorporate their preferences into the model.