Decoding SEO Meta Tags

I decided to take a little break from denouncing "Search Engine Optimization Myths" and focus on something else this weekend. Don’t worry, I would continue to break those myths, as there are still many more left in the pipeline! ;)

Today let us discuss the role of nofollow and dofollow tags in SEO. There was a time when I was ignorant of the meaning of those tags and would generally use them in a "mechanical" fashion. Well, you live and learn, so to speak, and today I have got a little wiser about those tags.

META tags are the tags that go like <META NAME=> before the <HEAD/> section of your webpage. Meta tags serve several purposes. Some meta tags instruct browsers how to display a particular webpage, while others instruct search engines how to index a webpage. In this article I will discuss the latter!

The following are some of the most popular meta tags used for the purpose of search engine optimization:

1) "Index" vs. "Noindex": If you use a command like:

<meta name="ROBOTS" content="index">

It tells the search engine robots to index the webpage. In my opinion, this is a redundant tag since search engines will index your page whether or not you include the tag.

On the other hand, if you use a meta tag like:

<meta name="ROBOTS" content="noindex">

It forbids the robots from indexing the webpage. You should add this tag before the closing <head/> tags of all the webpages you don’t want to be indexed. This command is equally respected by all search engines.

2) "Follow" vs. "Nofollow": These tags are used to instruct a search engine on whether to follow the links within your webpage or not, as well as the links to be followed/not followed by the robots.
Let us say that your webpage has a lot of links, internal and/or external. If you add a meta tag like:

<meta name="ROBOTS" content="follow">

It tells the search engine spiders to follow all the contained in the webpage.

On the other hand, a tag like:

<meta name="ROBOTS" content="nofollow">

tells them to do just the opposite, that is, "DO NOT follow any of the links in the webpage".

If you want to instruct robots to follow certain links within your webpage, just add the rel="follow" tag at the end of such links, such as:

<a href="http://domain.com" rel="follow">Keyword</a>

NOTE: Unless you have added the "nofollow" meta tag in your webpage, the rel="follow" tag is pretty much useless as your links would be followed by search engines whether or not you add that tag!

On the other hand, let’s say that your webpage has both internal and external links, and you want to block the robots from following the external links. In such a case, you can use the rel="nofollow" tag, such as:

<a href="http://domain.com" rel="nofollow">Keyword</a>

Links that have the rel="nofollow" tag added at their end won’t be followed by search engines!

3) "All" vs. "None": The meta tag <meta name="robots" content="all"> tells the search engines to index your entire site. This is yet another example of useless tag, as search engines would index your website by default.

On the other hand, if you want to tell robots NOT to index an entire website, or say, remove the website from the search engines’ index, you can add the <meta name="robots" content="none"> tag on all your webpages.

This is also a useless tag. If you don’t want robots to spider an entire website or certain directories within the website, nothing beats the ease of using the robots.txt file (which I will be discussing below)!

4) "Archive" vs. "Noarchive": This tag is obeyed only by the Googlebot. It basically tells Google whether or not to cache a certain webpage. The full meta tag should go like:

<meta name="GOOGLEBOT" content="noarchive"> (if your intention is to tell Google NOT to cache your webpage)

<meta name="GOOGLEBOT" content="archive"> (if you want to allow Google to cache a webpage)

Note that:

a) The <meta name="GOOGLEBOT" content="archive"> tag is pretty much useless, since Google would cache a webpage by default.

b) The "archive" and "noarchive" tags are NOT obeyed by any other search engines except Google; you may as well say that this meta tag exclusively belongs to GOOGLEBOT.

c) The "noarchive" tag should not be used side by side with the "noindex" tag! Think about it, if you disallow a bot from indexing a webpage, how would it be able to cache it? If you use both the tags side by side, it might confuse GOOGLEBOT and your webpage might be INDEXED (contrary to your desires). You know, robots are after all robots and they don’t possess the expert human brain! ;)

The "noarchive" tag is best used when you want Google to index your website but NOT to cache it. In such a case, you can use a meta tag like this:

<meta name="GOOGLEBOT" content="index,noarchive">

ROBOTS vs. GOOGLEBOT: GOOGLEBOT is the spidering robot of Google, while ROBOTS denote any non-Google spider (it can be Yahoo, MSN, Google, or even your own local search engine, just in case you run one).

There may be times when you want to offer different set of instructions to different search engine bots. For example, let us say that you want your webpage to be indexed by all robots EXCEPT Google. In such a case, you can use a meta tag like:

<meta name="GOOGLEBOT" content="noindex">
<meta name="ROBOTS" content="index">

How to Combine Meta Tags: Since meta robot tags are very flexible by nature, you can use them in any way you feel. I am giving a few examples below:

a) Example 1: You want to block all robots from both indexing as well as following the links within a webpage:

<meta name="ROBOTS" content="noindex,nofollow">

b) Example 2: You want to allow all robots to index a webpage, but block them from following the links contained therein:

<meta name="ROBOTS" content="index,nofollow">

c) Example 3: You want to allow all robots to follow the links contained within a given webpage, but block them from indexing it:

<meta name="ROBOTS" content="noindex,follow">

d) Example 4: You want to block Google from indexing a webpage as well as following the links within it:

<meta name="GOOGLEBOT" content="noindex,nofollow">

e) Example 5: You want to allow Google to index a webpage, but block it from following the links contained therein:

<meta name="GOOGLEBOT" content="index,nofollow">

f) Example 6: You want to allow Google to follow the links contained within a given webpage, but block it from indexing it:

<meta name="GOOGLEBOT" content="noindex, follow">

g) Example 7: You want to allow all robots EXCEPT Google to index and follow links of a webpage:

<meta name="GOOGLEBOT" content="noindex,nofollow">
<meta name="ROBOTS" content="index,follow">

h) Example 8: You want to allow ONLY Google to index and follow the links of a webpage:

<meta name="GOOGLEBOT" content="index, follow">
<meta name="ROBOTS" content="noindex,nofollow">

i) Example 9: You want Google to index AND follow the links of a webpage, but don’t want to have it cached in Google’s database!

<meta name="GOOGLEBOT" content="index,follow,noarchive">

Meta Robots Tags vs. Robots.txt: Meta robots tags are generally used to optimize a particular webpage, and/or tell search engine robots how to treat its content! The meta robots tags I discussed above are usually specific to certain webpages. If want to offer a global instruction regarding your website to the search engine spiders, you should be using the robots.txt file.

Here is an example of a typical robots.txt file I use for one of my websites:

User-agent: *
Disallow:
Disallow: /cgi-bin/
Disallow: /folderx/
User-agent: ia_archiver
Disallow: /
User-agent: slurp
disallow: /

The "disallow" tag basically blocks search engines from indexing a particular directory within a website or even an entire website!

The following are some of the most popular tags used in the robots.txt file:

a) User-agent: *: Any command that starts with " User-agent: *’ applies equally to all robots.

There are several user-agents or robots, most notable of them being:

i) Slurp: Slurp is the name of Yahoo’s search engine spider. More information can be found here.

ii) Googlebot: This is the name of Google’s robot. More information can be found here.

iii) ia_archiver: This is the robot used by Wayback machine to keep archives of your website’s history. This bot is also shared by Alexa and Amazon.com. More information on this bot can be found here and here.

iv) Arindambot: The secret robot I use to spy on and kill my competitors with grace :D (only kidding)!

b) Disallow: This tag tells the robots whether or not to index certain folders on your website. The tag " Disallow:" allows robots to index your website or a directory, while the tag " Disallow: /" does exactly the opposite. If you want to block robots from indexing an entire website, use:

User-agent: *
Disallow: /

On the other hand, if you want to allow all robots to index your website, you can either create an empty robots.txt file or add the following tags in it:

User-agent: *
Disallow:

If you don’t want Yahoo to index your website, just use this tag:

User-agent: slurp
Disallow: /

This is one universal tag I use for almost all my websites. In my experience I have noticed that Yahoo eats more bandwidth compared to the minuscule traffic it offers. Thus, allowing Yahoo is not only a waste of server bandwidth but also a good way to have your hosting account shut down! :

To disallow Google from indexing your website, use:

User-agent: Googlebot
Disallow: /

To allow Google but disallow Yahoo, use:

User-agent: Googlebot
Disallow:

User-agent: slurp
Disallow: /

To disallow all robots from indexing your website EXCEPT Google, use:

User-agent: Googlebot
Disallow:

User-agent: *
Disallow: /

To disallow any other robot, you can use:

User-agent: Robotname
Disallow: /

To allow any other robot, you can use:

User-agent: Robotname
Disallow:

To block all robots from indexing certain folders, use:

User-agent: *
Disallow:
Disallow: /folder1/
Disallow: /folder2/

To disallow Wayback machine from keeping history of your website, you can block it by adding this tag in your robots.txt file:

User-agent: ia_archiver
Disallow: /

This is yet another universal tag I use across all my websites, and I will tell you why (but please don’t tell anyone else, okay?). You see, when I start a website, I make a lot of goofy mistakes, and I don’t want the posterity to see those mistakes! I want them to believe I have been always "perfect". :D

If you disallow the user-agent " ia_archiver", it not only removes the history of your website from the index of Wayback machine but also blocks Alexa from hitting your site (thus saving you some valuable bandwidth)! That said, there are reasons why you may want to allow your website to be archived, such as the ones discussed here.

NOTE: DO NOT use the robots.txt file to disallow robots from spidering your (digital) download folders; your desire would be fulfilled for sure but human thieves and unspecified malware bots would still be able to read the location of your download folders from the robots.txt file! I don’t need to tell you what would happen next! :)

To protect your downloads:

a) Use a cryptic, 8-10 character word for your download folder. Search engines can hardly guess the names of cryptic and vague folders or filenames, let alone index them; the same holds true for human thieves as well!

b) Compress your download files using either WinZip or Winrar. It is best not to put your downloads in formats supported by search engines, such as .PDF, .DOC, .XLS, .TXT, .HTML, .RTF, etc., as files of these formats can be indexed by Google! On the other hand, Google and other search engines usually don’t index compressed files. This has been solely my experience.

c) While search engine robots usually cannot follow pages and files not linked to from the main website, Google sometimes goes against the norm. To be on the safe side, upload a blank index.html file in your download folder. If you are paranoid, you can make your case foolproof by adding the following meta tags BEFORE its closing <head/> tag:

This tells all robots NOT to index your index.html file!

d) If possible, keep your downloads above the root directory of your website (usually it takes the form of "public_html" or "www"). Some hosts allow the creation of custom folders above the root while others don’t. If your host doesn’t allow this, you may need to request for the same.

Testing Your Robots.txt file: Testing things is always a good idea. Google currently offers a free tool to help webmasters test the robots.txt file. You will need a Google™ account to access the tool!

Further reading on robots.txt:
http://www.robotstxt.org/robotstxt.html

Further reading on robots meta tags:
http://noarchive.net/refs/

As always, your comment are most welcome. I am usually very tired after writing such a long boring article, so please replenish my "lost energy" by posting sweet comments below! :D

3 Comments

Emma
June 21, 2009 at 10:00 am

Hi Arindam,

Another simple yet effective post! I think in some of your examples, you could make the code slightly more efficient i.e. reduce the code to content ratio, which is better for search engine rankings.

For example, you gave:
<meta name=”ROBOTS” content=”noindex”>
<meta name=”ROBOTS” content=”nofollow”>

But this can actually be:
<meta name=”ROBOTS” content=”noindex,nofollow”>

Not forgetting a forward slash before the closing bracket if you code in XHTML of course :)

It is good to be reminded of some of the page tweaks from time to time though – it’s so easy to get hung up on the content and forget about how your site appears in the search engines.

Thanks,
Emma
1. Arindam
  June 21, 2009 at 10:28 am
  
  That is a good tip indeed. Less code on page that way! ;)
  
  You won’t believe 90% of my pages are built that way (out of ignorance), and it is only recently that I have started implementing the “one line” meta tag! ;)
7 Simple Steps to Keep Angela’s Backlinks Alive | Arindam Chakraborty.com
June 9, 2011 at 6:23 am

[…] I digress from the main point. If you read my old article on SEO meta tags you would know how the robots.txt file works. Assuming that you have the Search Status plugin […]

Blog

Decoding SEO Meta Tags

3 Comments