Gossamer Forum
Home : General : Internet Technologies :

Internal Search Engine

Quote Reply
Internal Search Engine
Does anyone have any suggestions for an internal search engine (preferably will run via cron/ssh), and can handle 100,000+ pages?

It also needs to have the option to block specific folders, cos the LinksSQL one on the site in question has 500,000+ category pages Frown

TIA

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Internal Search Engine In reply to
"There is no such thing as a free lunch."

:-)

- wil
Quote Reply
Re: [Andy] Internal Search Engine In reply to
Try ht://dig. I haven't tried it myself but from what I gather, it's pretty good.

~Charlie
Quote Reply
Re: [Chaz] Internal Search Engine In reply to
Yeah, Wil recommended that on MSN, but I'm not sure if it supports the selection of which categories to spider?

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Internal Search Engine In reply to
I spotted this after a quick read through the docs:

Quote:
#
# If there are particular pages that you definitely do NOT want to index, you
# can use the exclude_urls attribute. The value is a list of string patterns.
# If a URL matches any of the patterns, it will NOT be indexed. This is
# useful to exclude things like virtual web trees or database accesses. By
# default, all CGI URLs will be excluded. (Note that the /cgi-bin/ convention
# may not work on your web server. Check the path prefix used on your web
# server.)
#
exclude_urls: /cgi-bin/ .cgi

#
# Since ht://Dig does not (and cannot) parse every document type, this
# attribute is a list of strings (extensions) that will be ignored during
# indexing. These are *only* checked at the end of a URL, whereas
# exclude_url patterns are matched anywhere.
#
bad_extensions: .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif \
.jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi .css

Will that not do what you want? It looks like you can exclude folders, files and even specific file extensions.

I really just glanced over the docs so you might want to dig (no pun intended :) ) into it a little deeper to see if it will work for you.

~Charlie
Quote Reply
Re: [Chaz] Internal Search Engine In reply to
Thanks, I didn't see that. I'll pass that link onto my mate, and hopefully this script won't crash the server every time its run Pirate

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Internal Search Engine In reply to
Hi,

I've used Swish http://www.swish-e.org/ before and it works quite well.

Cheers,

Alex
--
Gossamer Threads Inc.
Quote Reply
Re: [Alex] Internal Search Engine In reply to
Thanks, I'll have a look at that.

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!