Gossamer Forum
Home : Products : Links 2.0 : Discussions :

The RDF dump

Quote Reply
The RDF dump
Hi, I have installed links 2 successfully, but how would I go about using it to parse the RDF dump I have downloaded from dmoz.org??
Thanks, Julie

Quote Reply
Re: The RDF dump In reply to
Download this: http://www.iwebsupport.com/files/rdf-hs.txt and rename it to a .cgi extension.

You also need a copy of perl (http://www.activestate.com) and a web server (http://www.omnicron.com) on your computer. Put the files in one directory, set the variables in the script, and open command prompt (Start => Run => command). CD to the directory where the stuff is at, and type "perl rdf-hs.cgi".

Also note Links 2.0 cannot handle much more than 5000 links. If you plans were to use the whole thing, download POD from http://www.grohol.com/pod/ or buy Links SQL.

--Drew
Quote Reply
Re: The RDF dump In reply to
What exactly will this script do for me, and will it take long?? I'm sorry if the questions are stupid, but I don't know much perl or cgi stuff, thanks , Julie.

Quote Reply
Re: The RDF dump In reply to
The script will take those incredibly huge RDF files and put them into LINKS format. It depends greatly on the amount of data you want. If all you want is a certain subcategory, then the script can be set to skip over everything except that subcategory. But if you wanted the whole thing, then it could take an hour.

--Drew
Quote Reply
Re: The RDF dump In reply to
Hi Junko,

How do you set the variables in the script if you just want to index a subcategory, say Real Estate which is a subsection of Business ie. /Business/Real_Estate

for

$incats = "structure.rdf"; # DMOZ Category File
and for
$inlinks = "content.rdf"; # DMOS Links Database

Thanks


Quote Reply
Re: The RDF dump In reply to
Look for:
Code:
## To select only certain categories ... uncomment
## next if($cat !~ "somestringinhere");
and:

Code:
## To select only certain categories ... uncomment
## next if($Category !~ "somestringinhere");
Replace "somestringhere" with "Real_Estate" (or "Real Estate" or "Business/Real_Estate" or some variation of that.) and uncomment the lines with "next if" in them. Sorry, I haven't expirimented with this version yet so I can't be sure.

--Drew
Quote Reply
Re: The RDF dump In reply to
Ok, thanks. I saw that later in the script. But I still do not understand what the above two variables should be set at. I guess since you haven't used it you would not know but maybe someone knows.

Thanks

Quote Reply
Re: The RDF dump In reply to
$incats = "structure.rdf"; # This is DMOZ's category file (after being unzipped)
$outcats = "categories.db"; # This will be your LINKS category file
$inlinks = "content.rdf"; # This is DMOZ's link database (after being unzipped)
$outlinks = "links.db"; # This will be your LINKS link database

You shouldn't have to change any of that. If those files aren't in the same directory as the script, you'll need to put the full path to those files (ie, C:\httpd\cgi-bin\dmoz\structure.rdf)

The script will prompt you as to where to save the output if you don't creat blank database files.

--Drew
Quote Reply
Re: The RDF dump In reply to
Oh, I see. Well it's time to rock and roll then. I think will try it over the weekend.

Thanks for the info again.

Quote Reply
Re: The RDF dump In reply to
Can anyone tell me how to unzip the RDF dump (content.rdf.u8.gz) I have tried Winzip, gunzip, gzip, none work. There is probably a simple solution I am not aware of, please help.... Thanks in advance...

- support@nodeception.com

Quote Reply
Re: The RDF dump In reply to
try downloading the plain gzip file it's the one with out the u8 in the name.

Quote Reply
Re: The RDF dump In reply to
Yea, I was able to do that and everything worked smoothly after that. Thanks.

I am still curious how to decompress those .u8 files though, if anyone knows, drop a note.

Thanks,

Tim