(page requirements)

Keeping robots, spiders and wanderers away from your site using robots.txt, meta tags and other methods

Trying to keep those search engine spiders, wanderers and other cataloging robots away from your top secret web pages? Here are some measures you can take:
The "Robots Exclusion Protocol", the protocol designed to help web administrators and authors of web spiders agree on a way to navigate and catalog sites, require that you place a plain text file named "robots.txt" containing spidering rules, in the root directory of a site. It is important to note that this file must reside in the root directory of the main site, not in any other directory. For example, if your site is www.chami.com, the file must be accessible from http://www.chami.com/robots.txt
The content of the robots.txt file mostly consist of two main commands: "User-agent" and "Disallow".
The "User-agent:" command should specify the name or the signature of the robot which the spidering commands following it should be applied to. You can set this to * to instruct that the spidering commands should be applied to any robot that has not been identified in any other place inside the robots.txt file.
The other command, "Disallow:" specifies a partial URL that should be ignored (not index) by the previously identified web robot. If you leave this field empty, this will be interpreted as a license to navigate any and all pages in your site, by the specified web robot.
Let's take a look at some example robots.txt files:
• Tell all robots to go away (do not index any page in this site):
User-agent: *
Disallow: /
Listing #2 : TEXT code. Download robots1 (0.16 KB).
• Tell "WebCrawler" robot, for example, to leave this site alone. All other robots are welcome:
User-agent: WebCrawler
Disallow: /
Listing #3 : TEXT code. Download robots2 (0.17 KB).
• All robots should stay away from /~mydir/ Other directories are not restricted:
User-agent: *
Disallow: /~mydir/
Listing #4 : TEXT code. Download robots3 (0.17 KB).
• WebCrawler can access all directories except /~mydir/ All other robots may access all directories except /docs/, /private/ and /cgi-bin/:
User-agent: *
Disallow: /docs/
Disallow: /private/
Disallow: /cgi-bin/
User-agent: WebCrawler
Disallow: /~mydir/
Listing #5 : TEXT code. Download robots4 (0.21 KB).
    NOTE:Since the Robots Exclusion Protocol is not acknowledged by all web robot authors, it is not possible to stop all robots from wandering your site. However, the good news is that majority of the well known search engines and tools support this protocol. Refer to their documentation to verify this.

One of the major disadvantages of using the robots.txt file is that you must be able to place this file in the root web directory. If you don't have such access or if your web space provider is unable to give you a hand with this, you'll need a different way to stop robots. This is mainly why the "ROBOTS" META tag was created. Unfortunately even less number of robots look for this META tag at the moment.
ROBOTS META tag in action:
• Tell all robots to go away and not to index any pages in this site):
Listing #6 : HTML code. Download meta1 (0.17 KB).
• Allow indexing of the current page, however ask not to follow links inside the page for further cataloging:
Listing #7 : HTML code. Download meta2 (0.17 KB).
• Disallow indexing of the current page, yet allow following links inside the page:
Listing #8 : HTML code. Download meta3 (0.17 KB).
• Allow indexing and following links inside the page. This is the default action for robots, so it's not necessary to use this tag:
Listing #9 : HTML code. Download meta4 (0.17 KB).
Note that all META tags should be placed inside the HEAD section of your HTML document. For example:



  <!-- more meta and other tags -->



  <!-- document body -->

Listing #10 : HTML code. Download metasamp (0.26 KB).

Still unsure on how to proceed?
  1. Use the robots.txt file if possible
  2. ROBOTS META tag should be used if you can't create the above file. It's okay to use both methods if possible.
  3. If you know which robots you're trying to prevent from indexing your pages, a particular search engine for example, go to the source of the robot and remove your page if possible. In other words, many search engines are providing ways for you to remove your URLs from their indexes without having to use any of the above methods.
  4. Make your page stand-alone if possible. Meaning, remove links to the page that you're trying to keep away from robots. More links there are to your page the easier it is for a search engine robot to find your page. If your page is already in search engine indexes, it's too late to take this preventative step.
  5. If you must have absolute protection from robots, password protect those pages in question. Since all other methods are "agreements" that both parties must acknowledge in order for them to work in full, preventing the page from being served is the only way to guarantee that robots will not be able to touch your pages.
  • The Web Robots Pages
    The standard, FAQ, list of active robots, mailing list and other related sites.
  • BotSpot ?
    A resource for all things Intelligent Agent and bot related, including Bot of the Week.
  • Agents Abroad
    A list of bots; robots, spiders, agents, crawlers, and other automated intelligent agents.
  • Robot Exclusion Standard Revisited
    "A document intended to highlight some issues involving the current standard for robot exclusion, as well as to propose some suggestions for future expansion of the standard." -- June 2, 1996
Listing #1 : Web robots related resources
Applicable Keywords : HTML, Internet, Mini Tutorial, World Wide Web
Copyright © 2009 Chami.com. All Rights Reserved. | Advertise | Created in HTML Kit editor