Fun With Robots
February 25, 2014I’ve had a bit of fun recently poking around and looking at a few websites’ robots.txt files. Yes, it has been an exciting past couple of days!
For those of you who don’t know, a robots.txt is a file that gives instructions to web robots AKA web spiders or crawlers (e.g. Google’s web crawler). These instructions tell, or more accurately suggest, where the robots can and cannot access and how often they can query your website among other things.
However, the crawlers can often just ignore the your robots.txt suggestions like in the case of Yandex crawler from Russia or Baidu crawler from China or any malicious crawler. This can sometime drive people to block bad bots using access rules on their web server.
Some of robots.txt files are boring like Wikipedia’s but a lot of them contain Easter eggs such as Youtube (for those that don’t know the reference) or contain ASCII art. And StackOverflow apparently doesn’t like the Yahoo’s bot.
There is also the less common and less well known humans.txt that tells you the actual humans behind the website. Google for example makes up for a boring robots.txt file and have a bit of fun with their humans.txt and also adding a recruiting spiel similar to Glassdoor. Other websites such as Facebook, LinkedIn or Github take an understandably dim view to unauthorized crawling.
So there you have it - a 30,000 foot overview of robots.txt. Now get out there hide some Easter eggs for robots (or humans!) to find.