Keeping robots at bay
In general robots are innocuous creatures normally used to scan website for indexing. Occasionally they cause problems when they get themselves into loops (normally a undergraduate project gone wrong) and you want to keep them off your website.
How to detect robots
The only indication most people have that they are being visited by a robot is lots of accesses for all of your web pages in a short space of time either in your httpd server log files or firewall log files, and/or accesses to a file called robots.txt (or failed GETS for a file called robots.txt if the file doesn't exist) which is the key to controlling robots.
Controlling robots with robots.txt
By convention before beginning to index your web site a robot should read robots.txt - the first file it will request upon arrival. Create a robots.txt file in the document root directory.
# /robots.txt file for http://www.mydomain.com/
# all comments are preceded
by a hash #
User-agent: webcrawler
Disallow:
#specifies that the robot
webcrawler is NOT disallowed - it can read anything it finds
User-agent: studentbot
Disallow: /
#the robot studentbot is forbidden
to read anything on the site
User-agent: *
Disallow: /cgi-bin
Disallow: /logs
#most useful - ALL robots
are disallowed from reading files in the specified directory
#e.g. http://www.mydomain.com/cgi-bin
Controlling robots with META tags
If you can't create a robots.txt file you can try and ward off a robot using META tags. If you put
<META NAME="ROBOTS"
CONTENT="NOINDEX">
in your HTML document, that document won't get indexed by the robot.
<META NAME="ROBOTS" CONTENT="NOFOLLOW">
prevents the robot from following <A HREF> links in your documents (normally the causes of loops).
Of course non of these can control a robot that refuses to read robots.txt or the META tags. The only course of events open to you is to restrict access to the entire web site by IP address using TCP wrappers or your firewall.