Gossamer Forum
Home : Products : Gossamer Links : Discussions :

DMOZ Search Speed Question

Quote Reply
DMOZ Search Speed Question
Can someone please explain this?

After about two weeks of trial and error, we finally got the entire dmoz dump installed.

When we do a search for a lesser used phrase, i.e. "the simpsons" we get 377 results in a matter of less than 2 seconds but when we do a search for a popular term, i.e. "free" we get 44,217 terms but it takes between 8 and 12 seconds.

Now obviously my concern is to get this search faster than it is but my question is why is it slower with the additional terms? I just figured that it has to search through the entire table either way? Maybe how I see the search itself is incorrect so maybe someone can explain this to me so that I can fix this and make it faster than it is.

Any ideas how to improve performance? Would mod_perl or any other module help in the speed of the search itself?
Thank you.


Quote Reply
Re: DMOZ Search Speed Question In reply to
Hi,

The reason for the difference is search speed is because Links SQL is trying to give you the best matches first. So if you are searching on simpsons, it knows immediately there are only 377 records that have that result. It's then a matter of just sorting that result set to return the best match first which it can do very quickly (on our dmoz it comes back < 1.1 second first hit, 0.6 seconds subsequent searches -- not under mod_perl, so if the script was cached it would be much quicker).

For a large result set like "free", Links SQL as to sort through 40,000+ terms to find the best matches which takes quite a while. You'll notice you do get the best result first:

! Free Ultima Online Server, Free UO- Free Shard based near the backbone on Eastern USA.(Added: Fri Apr 28 2000 Hits: 0 Rating: 0.00 Votes: 0

You'll notice dmoz does some optimizations and only returns 2,500 results, so it's skipping some records. This may be a good solution to speed things up. Alternatively, you can cache the popular search terms (The first search took about 14 seconds, subsequent searches took about 4 seconds). DMoz was taking about 3 seconds but included download time.

mod_perl helps bring searches down from the 2 second range to under 0.5, but won't help dramatically for the really large result set.

Let me know what you think,

Alex

--
Gossamer Threads Inc.
Quote Reply
Re: DMOZ Search Speed Question In reply to
How do I get it to the point where it would only return 2,500 results and not 40K. Actually, it would be best if it can only return >500 results, nothing more than that.

As for subsequent searches, how would that work? Does it store those results some where? What makes the second search faster? Isn't their a way to use this idea to make everything faster?

Thank you.

Quote Reply
Re: DMOZ Search Speed Question In reply to
Hi,

We could tweak the search engine to limit the results, however, you would be losing the "best matches first" guarantee.

As for the speed up, mysql caches it's results, so subsequent searches return significantly faster then the initial search.

Cheers,

Alex

--
Gossamer Threads Inc.
Quote Reply
Re: DMOZ Search Speed Question In reply to
How long did it take to import the entire dmoz data? Mine has been running since last Thursday and is only at 670,000 links. It's driving me crazy taking so long. At this rate, it will take two additional weeks to complete the import.


My machine is a PII / 400MHZ / 196MB RAM.

Quote Reply
Re: [lisco] DMOZ Search Speed Question In reply to
I've been noticing a lot of unanswered posts. I'll try to help out where possible.

From start to finish, i.e., slicing the dmoz dump into 17 slices via my own developed python application, to importing the data into links sql, also via the python app + nph-import.cgi, it took from 4 - 8 hours. I say 4 - 8 as I started it prior to going to bed, and it was done by the time I got up. I know it took at least 4 hours via the timestamp on the slices.

Hardware stats: P4 2Gighz, 1.5Gig RAM, dual 80 Gig hard drives. RHT 7.3.

Software: python application, and nph-import.cgi

Problems noted: nph-import.cgi appears to not have exception handling for null Father ID's. My World/Russian import failed half way through, twice now, with the same error. I can't determine what is missing in the raw data as I can't read Russian, and as Russian is unlike a Romance language, I can't guess.