Share

Google’s Crawling Techniques

Google’s Caffeine update in 2010 was a major shift in the way Google works, and it helped Google in achieving real time indexing. The proof of which is Google Instant, and Instant Previews. And if we carefully analyse what Instant Preview offers it gives us a clear indication and enough proof that Chrome’s/ Firefox’s role in search stack. Tabbed feature of Chrome (or Chromium if we have to go back to Selenium project or Webkit Project) is to provide multi thread crawling, tabbed browsing is just a smart side effect of it

 

After this major update there has been a flurry of Patents by Google, shifting more and more towards user behaviour dominated algorithm. Whether it is click path, bounce rate, exit rate, page load time or even link analysis, everything is weighted according to how user interacts with the website or links. And even the buzz whether social media plays a part is all embraced in how user interacts with the content/ image been shared

 

Here’s a list of few patents and excerpts from few patents

  • Ranking documents based on user behaviour and/or feature data

“The user behavior data might be obtained from a web browser or a browser assistant associated with clients 210 A browser assistant may include executable code such as a plug in an applet a dynamic link library DLL or a similar type of executable object or process that operates in conjunction with or separately from a web browser The web browser or browser assistant might send information to server 220 concerning a user of a client 210”

  • Document Segmentation Based on Visual Gaps

“In an attempt to increase the relevancy and quality of the web pages returned to the user a search engine may attempt to sort the list of hits so that the most relevant and or highest quality pages are at the top of the list of hits returned to the user For example the search engine may assign a rank or score to each hit where the score is designed to correspond to the relevance and or importance of the web page”

  • Browser Shimming
  • Website Satisfaction profiling based on User action

 Spidering Techniques

So, if Googleblot is Chrome, how long will it be before Google crowd sources and Chrome evolves into Googlebot?

»