Gossamer Forum
Quote Reply
DMOZ
I was wondering if it was possible to import dmoz or certain categories into LINKS SQL at the present time. I couldn't tell from the threads. If so, is it just a matter of downloading content.rdf.gz, gunzipping it and then using Parse_RDF.pl

Thanks,

Kevin
Quote Reply
Re: DMOZ In reply to
I don't have LWP.

I tried to download contents.rdf.gz by right clicking the mouse button and saving this file but it stops at 99% and doesn't end with IE 5. Any other way to download this without LWP?

Thanks
Quote Reply
Re: DMOZ In reply to
Netscape navigator?? <G>
Quote Reply
Re: DMOZ In reply to
Try to do it from your webserver using telnet (if you have it):

1. Try typing:

lwp-download http://www.dmoz.org.rdf/content.rdf.gz

2. If that doesn't work try:

lynx -source http://www.dmoz.org.rdf/content.rdf.gz > content.rdf.gz

3. I haven't seen it anywhere via FTP, so otherwise you'll need to use your browser to get it and then ftp it up to your site.

If you get the latest Parse_RDF.pl you don't need to gunzip it, so it's (only) 80 MB.

Cheers,

Alex
Quote Reply
Re: DMOZ In reply to
I used the #2 command and it downloaded the open directory but the file size was about 450 MB. (458845755)

I only wanted a small section of this. Do I download this to my computer, edit it until I find the section and then run it through Parse_RDF.pl?

Will this erase my present links.db?

Thanks,

Kevin
Quote Reply
Re: DMOZ In reply to
Hello,

This is my problem to.

------------------
James L. Murray
VirtueTech, Inc.
www.virtuetech.com


Quote Reply
Re: DMOZ In reply to
I think lynx automatically gunzips (much like Netscape does), as the .gzipped version should only be around 80 MB.

Try typing:

more content.rdf.gz

and if it is in clear text, then it's already unzipped and you should type:

mv content.rdf.gz content.rdf

Then you can edit Parse_RDF.pl and set what section you want, and start it going. It can take anywhere from 5-20 minutes depending on your server to parse out the data you want.

It will append the information and not erase your existing links (i.e. if you run it twice, you'll get a lot of duplicates). Wink

Cheers,

Alex
Quote Reply
Re: DMOZ In reply to
I have several questions to ask about running the Parse_RDF.pl file.

1. How do I run the Parse_RDF.pl? Do I have to run it from telenet or can
I run it from my browser?

2. What does the file name for the RDF dump content have to be named,
content.rdf or content.rdf.gz?

3. Also, where does the RDF content file need to be uploaded to? My guess
would be the same directory as the Parse_RDF.pl file is located.

4. If I want to start the the import from the top of the content file is
this the correct change in the Parse_RDF.pl file: my $SUBSET = 'Top';?
In other words do I need the trailing slash, /

5. What are DBI connection parameters and if mine are different how do I
find out what they are?


Thank You Very Much,

Mark Foster


Quote Reply
Re: DMOZ In reply to
Ill try, but I think I may only get 4 of the 5.

>> 1. How do I run the Parse_RDF.pl? Do I
>> have to run it from telenet or can
>> I run it from my browser?

Run it from Telnet. It will time out if you don't.

>> 2. What does the file name for the RDF
>> dump content have to be named,
>> content.rdf or content.rdf.gz?

Check the top of the Parse_RDF.pl file. There are two different versions -- one will read the .gz file without unzipping it first, the older version required the content.txt file -- but you can change the name and location by editing the top of the script file before running it.

>> 3. Also, where does the RDF content file
>> need to be uploaded to? My guess
>> would be the same directory as the
>> Parse_RDF.pl file is located.

See #2 ... you can change it in the header. I believe the Parse_RDF.pl needs to be run from inside the setup directory, in order to properly find the database definition files, but it may not need be if you define the database parameters in the line of the file -- you need to supply the ID and PW for your system.

>> 4. If I want to start the the import from
>> the top of the content file is
>> this the correct change in the
>> Parse_RDF.pl file: my $SUBSET = 'Top';?
>> In other words do I need the trailing
>> slash, /

Good question! I think you just need "Top" but the only way to know for sure is to run it.... and see what happens <G>

>> 5. What are DBI connection parameters and
>> if mine are different how do I
>> find out what they are?

I thought they were read out of the Links.pm files, and .def files that the script defaults to looking for in the '..' directory, which is why you need to run it from inside the setup directory.

But I see in the version of the file I have they are set in the line:

# Set your DBI connection parameters here.
my @dbi_Links = ('DBI:mysql:Links:localhost', 'root', '', {PrintError => 0, RaiseError => 0});


You need to replace 'root' with your database ID and '' with your password. They are the same ones you entered into the setup program, and they are the same ones that are listed at the top of each of the .def files.

Hope that helps!