Quite simply, if you are going to take data- ANY data- from someones web site, you need to play by their rules. Robots.txt and the meta robots tags are the correct way to listen to a web publisher's wishes about what to grab or index, and what not to. Webmasters have a standardized way to tell you- or your bot- what they do and don't want trafficed, and there are fairly simple ways for you to adhere to those wishes
If you choose NOT to follow those requests, you are going to piss a lot of webmaters off. Try and spider my site- you won't get very far, I can guarrantee you. (Your spider is not currently banned on my site, but if you try and run it there- and go ahead!- it will get itself banned before it gets to the third page)
It is very simple to ban something like this, and I know a large, "silent" community of webmasters who spot these sorts of things quick, and pass the info on to others, which will block the ability of spiders to gather any info at all.
You probably do not want all kinds of unrestricted bots running crazy through your website, and you should respect similar wishes of other webmasters. If you want to read a fairly good discussion on this, I recommend this: Why should I obey robots.txt
dave
dave
Big Cartoon DataBase
Big Comic Book DataBase
If you choose NOT to follow those requests, you are going to piss a lot of webmaters off. Try and spider my site- you won't get very far, I can guarrantee you. (Your spider is not currently banned on my site, but if you try and run it there- and go ahead!- it will get itself banned before it gets to the third page)
It is very simple to ban something like this, and I know a large, "silent" community of webmasters who spot these sorts of things quick, and pass the info on to others, which will block the ability of spiders to gather any info at all.
You probably do not want all kinds of unrestricted bots running crazy through your website, and you should respect similar wishes of other webmasters. If you want to read a fairly good discussion on this, I recommend this: Why should I obey robots.txt
dave
dave
Big Cartoon DataBase
Big Comic Book DataBase