SEO for Beginners: Understanding Internet Crawlers

Posted on the 20 April 2013 by Zafar @seompdotcom

Many people find the whole SEO subject mind-numbingly confusing. Trying to understand it involves so many technical terms it can be learning a foreign language.

So let’s start with one of the basics. One of those strange words thrown about is ‘Crawlers’. Also known as Spiders or Ants, these are robot programs that run automated searches of the internet. The information they find is copied, processed and indexed for use by search engines.

To find all the public pages on the internet and work out which one best meet a user’s needs search engines start with a selection of known high-quality sites called ‘seeds’, then using the links on these pages they migrate to other pages, and look for more links. Moving from link to link forms the map that connects the internet together, so the Crawlers can literally crawl their way through billions of interlinked documents. The crawlers use the links to help them analyze the content of that site. If the links are relevant to that site it helps the crawler to assess the overall relevance of that site to a user’s keyword query.

This link crawling process is repeated until a whole series of interrelated or meaningful sections from pages can be stored under certain keywords and held in a huge array of hard drives ready for when a human user conducts a search.  These hard drive databases are updated constantly by crawlers as new sites come on board. There are trillions (that’s hundreds of billions) of public pages on the internet so this is an enormously complex process, but the crawlers achieve it at a much faster rate than any human possibly could.

Other data recorded includes lists of all the possible keywords – which is no small feat, a map of all the links, and whether those links are likely to be adverts, and the ‘anchor text’ used in links, i.e. the words often (but not always) underlined in blue. If those words are ‘click here’ it tells the crawler nothing. But if the anchor text says ‘sausages’ then a valuable piece of information is stored about that sausage-themed landing page which helps place it in the page rankings for the keyword ‘sausages’.

The algorithm that determines the relevance of anchor text was refined by Google after a practice known as “Google bombing” became a problem. The most notorious campaign involved the anchor text ‘miserable failure’ which linked to George W Bush’s biography page on the official White house website, resulting in that page having number 1 ranking for searches under ‘miserable failure.’ The campaigners achieved this by getting as many people as possible to have this anchor text on their web pages. This had an inevitable impact on relevance. So now it is advisable to have anchor text which is relevant to both the published site and the landing site even if that relevance is tenuous. For example if you had a website dedicated to an author of crime novels, many of which were set in Ireland and you published anchor text that linked to a landing site about Irish whiskey, then a crawler would consider this a relevant link. They don’t judge they just read words.

This relevance increases the page ranking of the Irish whiskey site, because of a complicated formula whereby Site A (crime writer) has given a small portion of page rank to Site B (Irish whiskey) by linking with it and having keywords in common (Irish).

The size of Site B’s page rank portion is dependent on how much page rank Site A had to begin with.  So to climb up the SERPs it’s advisable to have inbound links from popular sites with relevant subject matter.

This algorithm refinement for relevance in linking is part of a search engine’s constant battle against techniques that manipulate search engines in order to build page ranking, termed ‘Black Hat SEO’. Search engines give a very dim view of this practice as it easily spoils quality of service for the user. Who wants to see a search results page full of irrelevant information?

These days popular blogging sites are finding themselves bombarded with bogus guest posting requests aimed at getting anchor text inserted in their pages. This is often an automated process, so the offending pages necessarily use repetition to achieve their aims. To counteract this kind of practice Google has refined its algorithms to ignore too much repetition, too many instances of anchor text and content that hasn’t been regularly updated. So webpage content that repeats taglines or sections of prose will find itself dropped from the search results pages pretty sharpish.

Search engines are trying to make the user’s experience of the internet as useful as possible. Web designers have to improve content to get noticed, which can only be good news for everyone.

Thursa Wilde is a content writer for Gladwords.com, a link-building company.  

5 / 5 stars