Gossamer Forum
Quote Reply

I have a question concerning the extracting of the dmoz data. I have search through the forums and read dozens of postings but this has only made me more confused.

I have downloaded the content file from dmoz as well as the structure file. I have the script from the Hyperseek site at http://www.iwebsupport.com/files/rdf-hs.txt
but this appears to need the content file 'unzipped' - my virtual server is limited to 400MB so this is no good for me.

I read in these forums that there is a file called parse_rdf.pl that works with the file in its zipped format. I cannot find this file anywhere. I am guessing that it is only available to people who have purchased Links. I am a Hyperseek owner and therefore would not have access to this.

Can you please confirm that my guesses are correct and let me know if there is anyway I can get this script for free or if not whether there is another option open to me other than purchasing 'Links SQL'.

For example can I split the dmoz data into two files before it is unzipped or is there anywhere else on the web I can get the up to date dmoz data in smaller chunks. Or even better anyone who has been through this hastle and is willing to share their database with me for a small fee. I am looking for about 50,000 links for a general search engine.

Many thanks.

David Hayden

Quote Reply
Re: DMOZ Data In reply to
you could download the file to your local machine, unzip it there, then use one of the text slicing utilities to cut it into smaller pieces.

I've looked for something that will slice up the RDF file, but there doesn't seem to be much in the way of utilities for that.

Really, DMOZ should start backing their database up in chunks, one for each of the main categories would be a start.

FAQ: http://www.postcards.com/FAQ/LinkSQL/

Quote Reply
Re: DMOZ Data In reply to

Send me an email and I can send you the file, however you should probably contact Hyperseek as I'm sure John has already done this and it will work much better for you.



Gossamer Threads Inc.