Gossamer Forum
Home : Products : Gossamer Links : Development, Plugins and Globals :

[New Plugin] Spider!

(Page 1 of 2)
> >
Quote Reply
[New Plugin] Spider!
I've now got around to completing my Spider plugin. Main features include;

Idea: The idea behind this plugin is to spider another site for links. It will go through the page you enter (currently only one depth), and grab all the links on that page. You can then select which pages you want to spider. Once you decide, then you continue, and it will goto each of the pages, and grab the requested details (description, title and author). Its all done by meta-tag recognision, so its not fool proof by any means. I got about a 95% success rate with it accuratly grabbing the details. You are then shown a list of all the sites that were spidered. It shows the details that were grabbed, and gives you a direct link to add the form (via admin, with all forms pre-completed). Also, inline checking to see if the URL exists in the directory already.

PRICE: $50

Current features:

  • You select a URL, and it spiders it for URL's.
  • Can save hundreds of hours going to each new site you want to add, and getting the meta-tag descriptions, or making up your own.
  • VERY Easy to use. Simply enter the URL, and your on your way...Smile

Limitations:

  • If the site you are grabbing the links from, is a LSQL/Links 2 directory site, then the spider cannot get the URL's correctly, so it has to list jump.cgi. This can be worked out by simply clicking on the title of the spidered page, and then just copying and pasting the URL from your browser.
  • I'll let you know once I find any more ;)


More details can be found at;

http://www.linkssql.net/plugins.html (goto the bottom link on the fee plugins section).

If you have any questions, please feel free to email me, PM me, or post your question here.

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] [New Plugin] Spider! In reply to
Finally managed to get a demo up for this plugin too. You can see details on it at http://www.linkssql.net/plugins.html

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] [New Plugin] Spider! In reply to
It's kinda difficult viewing a demo inside a tiny frame...also having links turn blank when you mouseover on a black background isn't too helpful :)

The plugin barfs too:

A fatal error has occured:

Missing base argument at /usr/lib/perl5/site_perl/5.6.1/HTTP/Response.pm line 172

Please enable debugging in setup for more details.

Last edited by:

Paul: Jan 14, 2003, 3:21 AM
Quote Reply
Re: [Paul] [New Plugin] Spider! In reply to
Did you change the password? That is REALLY sad if you did!

Anyways....I can't get that error to reproduce...all seems to work fine for me Unsure

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!

Last edited by:

Andy: Jan 14, 2003, 3:30 AM
Quote Reply
Re: [Andy] [New Plugin] Spider! In reply to
Plugins > Spider Site > Spider It

Gives the error. Submit the form as it is.

It also starts spidering if you enter http://

Some of the links break the code too.......

URL: http://www.yahoo.com/r/cy
Title: Yahoo! Media Relations
Description: None Got
Author (owner): Someone
Email (owner): copyright@yahoo-inc.com">copyright@yahoo-inc.com
Options: copyright@yahoo-inc.com&Title=Yahoo! Media Relations" target="_blank">Add
Status: Ok....couldn't find in database...

Last edited by:

Paul: Jan 14, 2003, 3:35 AM
Quote Reply
Re: [Andy] [New Plugin] Spider! In reply to
Oh, and you can refresh the spidering page and it always says the links can't be found even if you just added it.
Quote Reply
Re: [Andy] [New Plugin] Spider! In reply to
Nothing gets inserted into the database.

Looks like it needs some more work yet.
Quote Reply
Re: [Paul] [New Plugin] Spider! In reply to
In Reply To:
Nothing gets inserted into the database.

Looks like it needs some more work yet.

May help if you had read the Readme Wink You need to click on the 'add' link before it is added tothe database. This allows you to define a category that it needs to be added to, and make minor changes.

The other ones are small bugs...but I'm working on getting them worked out now.

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Paul] [New Plugin] Spider! In reply to
In Reply To:
Oh, and you can refresh the spidering page and it always says the links can't be found even if you just added it.

Untrue. You just were not adding them right Wink

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] [New Plugin] Spider! In reply to
I have now got a demo and screenshots setup for this Plugin. Details for demo can be found at:

http://new.linkssql.net/...amp;page=Spider#demo

Screenshots are also on this page.

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] [New Plugin] Spider! In reply to
An update has been made. It seems that email addresses with a dot in the front part of the address caused the wrong email to be grabbed. I've updated the grabbing code, and also added a bit more regex, so we don't need to bother with mailto: and javascript: links on the pages grabbed Smile

Version is now 1.1.

Accounts URL: http://new.linkssql.net/page.php?page=account

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] [New Plugin] Spider! In reply to
Andy:

Polite suggestion.

Have the spider read and obey a robots.txt file if it is going to spider any site other than your own.

It is not polite- and will get your IP and/or UA banned on many sites when you go poking around where the webmasters have deemed it off limits.

2 cents, please...
dave

Big Cartoon DataBase
Big Comic Book DataBase
Quote Reply
Re: [carfac] [New Plugin] Spider! In reply to
Mmmm...I'm just wondering how to implement this. The robots.txt file normally resides in the root of the domain, right?

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] [New Plugin] Spider! In reply to
Andy:

Yes- for it to be "legal", it is is the root, and ONLY covers the domain it is in.

Not sure what all the rules are, but I can send you to a site that should help.

Go to http://www.webmasterworld.com, and check out the Search Engine Spider Identification forum- people there are voratious about protecting their websites from spiders!

This is probably a good thread to start with: Why should I obey robots.txt
dave

Big Cartoon DataBase
Big Comic Book DataBase
Quote Reply
Re: [carfac] [New Plugin] Spider! In reply to
Cheers..I'll have a look at this :)

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] [New Plugin] Spider! In reply to
Andy:

Happy to be of help. I THINK (not sure, but I think) there is a fairly easy to use/insert bit of code that works with LWP (not sure with trivial) that should make it robots.txt compatible, with very little sweat...

Anyway, if you do that, I am sure it will be a valuable addition to the plug in!

dave
dave

Big Cartoon DataBase
Big Comic Book DataBase
Quote Reply
Re: [carfac] [New Plugin] Spider! In reply to
I'm having a look around for that module now. Definatly would make a nice addition to the plugin :)

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] [New Plugin] Spider! In reply to
Andy,

Another counter point...

A large number of developers do not use robots.txt files on their sites but instead use the page level meta tagging for robots such as:
<META NAME="ROBOTS" CONTENT="INDEX">
<META NAME="ROBOTS" CONTENT="NOINDEX">
<META NAME="ROBOTS" CONTENT="INDEX,FOLLOW">
<META NAME="ROBOTS" CONTENT="INDEX,NOFOLLOW">
<META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW">
<META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">

It would be my suggestion that the meta tag for robots be looked at as well as looking for the robots.txt....even though the .txt file is a bit old school it is still needed to get confirmation along with the meta tagging of what pages are INDEX, NOINDEX, FOLLOW, NOFOLLOW

just a thought



Brian
Quote Reply
Re: [Teambldr] [New Plugin] Spider! In reply to
I'm starting to wonder if this is actually needed in my Plugin. It doesn't work in the same way as GT's one (which goes through, and spiders to a certain depth). Mine simply grabs the page, parses out the URLs, and gives you the option to spider those URL's. It will not go any deeper than the page you selected to spider. Its literally a simple spider. The main aim of it is to stop people having to manually go to each page and grab the description/title/author etc.

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] [New Plugin] Spider! In reply to
I thing the name "Spider" is mis-leading people.

Your plugin just fetches a page and grabs the meta data, it doesn't spider.
Quote Reply
Re: [Andy] [New Plugin] Spider! In reply to
that is a valid concern.

Quote:


The main aim of it is to stop people having to manually go to each page and grab the description/title/author etc.


THen make it a simple meta grabber:
<META NAME="keywords" CONTENT="keywords are here">
<META NAME="description" CONTENT="Description Here">
<META NAME="ROBOTS" CONTENT="INDEX">
<meta name="language" content="en-us">
<meta name="rating" content="GENERAL">
<meta name="distribution" content="GLOBAL">
<meta name="classification" content="Category Here">
<meta name="copyright" content="Copyright Tag Here">
<meta name="author" content="Author Here">
<meta name="revisit-after" content="7 Days">
<meta http-equiv="reply-to" content="Site Email Address Here">

It will give you anything you need to dump into a linksSQL database or even can be used to fill the form for a person submitting their site to a LinksSQL dayabase.



Just a thought



Brian

Last edited by:

Teambldr: Jun 17, 2003, 4:33 PM
Quote Reply
Re: [Teambldr] [New Plugin] Spider! In reply to
Quite simply, if you are going to take data- ANY data- from someones web site, you need to play by their rules. Robots.txt and the meta robots tags are the correct way to listen to a web publisher's wishes about what to grab or index, and what not to. Webmasters have a standardized way to tell you- or your bot- what they do and don't want trafficed, and there are fairly simple ways for you to adhere to those wishes

If you choose NOT to follow those requests, you are going to piss a lot of webmaters off. Try and spider my site- you won't get very far, I can guarrantee you. (Your spider is not currently banned on my site, but if you try and run it there- and go ahead!- it will get itself banned before it gets to the third page)

It is very simple to ban something like this, and I know a large, "silent" community of webmasters who spot these sorts of things quick, and pass the info on to others, which will block the ability of spiders to gather any info at all.

You probably do not want all kinds of unrestricted bots running crazy through your website, and you should respect similar wishes of other webmasters. If you want to read a fairly good discussion on this, I recommend this: Why should I obey robots.txt

dave
dave

Big Cartoon DataBase
Big Comic Book DataBase
Quote Reply
Re: [carfac] [New Plugin] Spider! In reply to
Something else you should consider- I would make it so your bot does NOT hit any given website more than once every five seconds or so. Yes, you could go faster, but again, you hammer sopmeones website, you are gonna find yourself banned very quickly...
dave

Big Cartoon DataBase
Big Comic Book DataBase
Quote Reply
Re: [carfac] [New Plugin] Spider! In reply to
Dave,

I think he has a differnet use in mind then stealing content. It is to grab the tags and make them ready to insert into the database. Not to import content and place it on a website. If I am reading it correctly.

I am very strict about the misuse of bandwidth as you have brought up as well. And I want to make only a specific path available to real spiders as you do as well. Valid points. I do, however, look for isolated issues and review them manually before I ban an IP as this can be very bad for search engines and real users. I did read the thread you left a link for. They all had some valid points. I would think after reading it that they are for the most part paraniod and extreme on both sides of that discussion and were not looking for ways to imporve the experience but rather control things to a level that would choke off a site.

Real spiders are not used to steal data...only to index it and create links to it. And to follow links on the spidered page and index them as well. At least in the way that I look at spiders. HTML scrapers that are meant to take all text and images off a page to be used on another site, that is a whole new ballgame. And as I said, I do not believe that is the use for this plugin. If that makes sense.



Just a thought



Brian
Quote Reply
Re: [Teambldr] [New Plugin] Spider! In reply to
>>>I think he has a differnet use in mind then stealing content. It is to grab the tags and make them ready to insert into the database. Not to import content and place it on a website. If I am reading it correctly.<<<

Yeah, thats the aim of the plugin.

>>>. HTML scrapers that are meant to take all text and images off a page to be used on another site, that is a whole new ballgame.<<<

Exactly.

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
> >