Gossamer Forum
Quote Reply
Spider questions
Hi Alex,

I am very interested in the spider plugin, but I am not sure if I can actually use it on my site. So I have a few questions:

1. Is it possible to run this plugin on a virtual server (i.e. not a dedicated server, it's the same one where LinksSQL 2 is actually running) with shell, but not root access?

2. Are there any experiences on server load, memory and cpu usage, disk space needed etc. and what load is it causing on the mysql host (which in my case is not the same as the site server)?

3. Is it possible to manually restrict user searches to certain pages of the spidered ones, maybe by adding an indicator column to the spider's database?

There are two things I'd like to do:

a) Let the user choose an option if he likes to search all of the spidered pages, or only pages from my own site (not the Links pages but pages that are generated outside of LinksSQL)

b) Set up an individual search page from where only certain pages can be searched. I'd like to mark these pages in the spider's database by adding a custom column, so that only marked pages will be searched and not the whole database.

Having the spider would surely be a big plus for our site, but before I purchase that, I want to be sure that I actually can use it.

Andreas

--------------------------------------------
http://www.archaeologie-online.de
Quote Reply
Re: Spider questions In reply to
Hi,

1. Yes, as long as you have shell access you can use it. To start spidering sites, you must run spider.pl from shell as it runs as a daemon. Once it's running, you can check on its status and stop it from the admin panel.

2. By default the spider will not hit the same site more then once per second. So it's quite friendly on load. We had it running for two days at around 5% CPU resources. The sites do not get indexed right away, the data is just inserted into MySQL, so when you do index the sites, it can cause some server load, but that's for a short time.

3. Pages that get spidered are not validated by default and won't be searchable until they've been validated (which yes, can easily be done in bulk).

a. This is possible, if you pass in spider=1, then the spider database will be searched as well as the user database and be available as spider_hits and spider_results.

b. Not sure about this one. =)

Cheers,

Alex

--
Gossamer Threads Inc.
Quote Reply
Re: Spider questions In reply to
Oh, see http://www.gossamer-threads.com/...pider/htmlsetup.html for more info on the integration with the search engine.

Cheers,

Alex

--
Gossamer Threads Inc.
Quote Reply
Re: Spider questions In reply to
any chance of a demo?

http://www.ASciFi.com/ - The Science Fiction Portal
Quote Reply
Re: Spider questions In reply to
Hi Alex,

thank you for your answers.
Anyway, some points are still not quite clear to me.

1. resources
------------
Isn't 5% server load very much for a virtual server with say about 20 or so other virtual servers on the same machine?
You say, just the indexing causes some server load for a short time. What is short?
At the moment I got about 2.000 links in the Links database and indexing the whole thing lasts about 10 minutes. Assuming the spider will fetch say 20.000 pages - will indexing last 100 minutes then?
How much resources will be needed during that time?
I ask because my provider doesn't allow banner systems based on mysql for example for the reason that these are writing to the database for every page called (banner served) and that mysql wouldn't be optimized for writing but is for queries (only reading, i.e. retrieving select results). I wonder if they will allow use of the spider...

2. Restrict searching
---------------------
You say it's possible to let users search only pages from my server by passing spider=0 as argument. The spider help file says that then only the links database will be searched, not the spider database.
My question was about pages which are generated outside from LinksSQL and therefor are not part of the Links database. So, to make them searchable they have to be spidered. This means to me, they can't be searched when passing spider=0.

I would like to offer my users the possibility of a site wide search (i.e. search only pages with URLs starting with "http://www.archaeologie-online.de") in addition to the option to search the complete spider database.
So I can't use the filtering mechanism to exclude other pages than mine from spidering, because I want to have other pages in the database, too. I just want to restrict the search, not the spidering.
Would that be possible by using a variable in the search form like "url=http://www.archaeologie-online.de*" or something like that?

Generally speaking: Is it possible to search certain fields of the spider database as it is with LinksSQL's internal search?

Andreas


--------------------------------------------
http://www.archaeologie-online.de
Quote Reply
Re: Spider questions In reply to
Hi,

1. I would be hesitant with the host from what you describe. Think of what the spider is doing:

a. It has to go out and fetch 20,000 URL's. For each URL it has to store in MySQL the content data, the headers, the host information, the robots.txt file, etc.

b. Once done, you need to index that data to make it searchable. This needs to examine each page, split up the keywords and insert a record for every keyword.

This is a little better then an Ad database, as you only have one process doing all the writes, whereas with an ad database you have multiple simultaneous processess.

2. That should be spider=1 not spider=0. As for filtering, no you currently can't search links from the spider database with urls like '...'. You must search the entire database.

Cheers,

Alex

--
Gossamer Threads Inc.
Quote Reply
Re: Spider questions In reply to
What would you want to see? The admin panel in action? Since the spider must be started from shell, it's difficult to make this an interactive demo.

Cheers,

Alex

--
Gossamer Threads Inc.
Quote Reply
Re: Spider questions In reply to
How about screenshots of the shell process?

Regards,

Eliot Lee
Quote Reply
Re: Spider questions In reply to
Hi,

When you run spider.pl it says:

Launching Spider ..
Spider has Daemonized.

and you are returned to shell. There is a spider.log showing spider activity (url's fetched, etc), however this is also available from the admin.

Cheers,

Alex

--
Gossamer Threads Inc.
Quote Reply
Re: Spider questions In reply to
Hi,

it seems that as long I am not able to restrict the spider search either by adding custom columns to the spider's table or by restricting search to certain sites, the spider plugin does not fit my needs.
I regret this because since the announcement that you were developping the spider I hoped that this would be the solution, because at the moment I've got no site wide search possibility for my own site - just for the Links database. This is somewhat confusing for my users, because they get interesting links as search results but nearly no page from my own site.

I even thought of moving the site to a dedicated server due to performance reasons when using the spider, although the fee is quite expensive.

So it seems I have to look for another solution to enable a site wide search as long as the spider can't handle that.

Does anybody have any suggestion for a script that can be used to search "normal" HTML pages as well as the Links database (Perl or PHP if possible)?
(Using the spider plugin for this purpose only is unfortunately way too expensive for me. When I could use it as a site search and a search engine - as described in my previous posts - I would surely buy it.)

Thank you.

Andreas

--------------------------------------------
http://www.archaeologie-online.de
Quote Reply
Re: Spider questions In reply to
Hi,

To search your own site, why not look at Swish, or HTDig, or ICE, or some other site based search engine? These programs index files on your own server and then let users search them? Or am I missing something?

Cheers,

Alex

--
Gossamer Threads Inc.
Quote Reply
Re: Spider questions In reply to
Alex, i would like to be able to do some searching in a database you have spidered to see how results come back. I want to get an idea what would happened if after spidering all the links in my directory i did a 1 word search. How are the sides ranked in search (keyword density?) these sort of things....

http://www.ASciFi.com/ - The Science Fiction Portal