Gossamer Forum
Home : Products : Gossamer Links : Version 1.x :

Dmoz.org Import Question

Quote Reply
Dmoz.org Import Question
Hello,

I only want to include the category /World/Espanol/ into my DB.

Is this possible without downloading the whole rdf file from dmoz.org. The download windoe said it would be another 12 horus on my modem. Any ideas?

Am I not doing it right?

------------------
James L. Murray
VirtueTech, Inc.
www.virtuetech.com


Quote Reply
Re: Dmoz.org Import Question In reply to
you have to download the whole thing.. you use the script to get the category..

jerry
Quote Reply
Re: Dmoz.org Import Question In reply to
What you need is an editor that "edits in place" what that means is it doesn't try to load the whole file into memory.

I can't tell if EditPlus does that, they don't list file size limits.

I do know I was able to edit the file on Unix with Joe (Joes Editor) -- It takes awhile to load (3-4 minutes sometimes) but it doesn't crash, and you can locate, mark and write out blocks of the file to smaller files.

I remember under DOS I also used a program called "List" it was a viewer program, but you could search and highlight blocks of text to write out to smaller files. Because it was a viewer, it didn't have to keep the same sort of information on hand that an editor did, and was very fast.

There are probably similar editors for Windows, but I don't have a list of them.

Quote Reply
Re: Dmoz.org Import Question In reply to
How about FTPing it direct to your server? If you have Telnet availability, you can use this tiny script:

Code:
#!/usr/local/bin/perl
use LWP::Simple;
$url = "http://dmoz.org/rdf/content.rdf.gz";
$rc=LWP::Simple::getstore($url, "content.txt");

Depending on the size of your connection, it could take anywhere from 20 minutes to 20 hours. But, usually your server is connected to a larger (at least 1M pipe) than your modem.

The downside, is if you are charged for bandwidth, it's going to be 60 meg of bandwidth. Additionally, no matter WHERE you get the file, it expands to over 500 MEG in size. That's ONE HALF GIG

I run my own server, so I was able to snag it and expand it, and trash it when I was done. It's big.

ODP should really make the file available in sections by TLD, it would cut their bandwidth, and make it easier for most people to obtain.



------------------
POSTCARDS.COM -- Everything Postcards on the Internet www.postcards.com
LinkSQL FAQ: www.postcards.com/FAQ/LinkSQL/







Quote Reply
Re: Dmoz.org Import Question In reply to
Hi:

I just spent a day d/l dmoz... and I expanded it (huge!)

I cannot get anything to OPEN it to even begin editing out what I need- nothing (word or wordpad) can handle a file so big. Any suggestions?

Also, Jerry, you mentioned "you use the script to get the category"... what script?

Happy Holidays!

Dave

[This message has been edited by carfac (edited December 24, 1999).]
Quote Reply
Re: Dmoz.org Import Question In reply to
For 30 days, you have http://www.editplus.com/
Editplus, its really good.
Quote Reply
Re: Dmoz.org Import Question In reply to
Hello Pugdog,

This is exactly what I was looking for or to do. Thanks.

I ran the script though through telnet like this: perl dmz.cgi

From the scripts directory and it just went to the next prompt. Does that mean it is working. Where is it downloading to? From the looks of the script you gave, it looks like to the directory where the script is ran.

Any help?

------------------
James L. Murray
VirtueTech, Inc.
www.virtuetech.com


Quote Reply
Re: Dmoz.org Import Question In reply to
pugdog, bmxer:

Thanks! I appreciate the help!

Happy New Yesr to all!

Dave
Quote Reply
Re: Dmoz.org Import Question In reply to
GZ --

This script requires all the stuff that Links required in the last version -- ie: LWP.

It shouldn't return to the prompt until it's done, so something didn't run or you are on a REALLY fast connection. It should download to the current directory.

This script has NO error checking, I ran this script last week, and it worked just fine. I just re-ran it, and it worked. It downloads a file called content.txt (I think DMOZ has the names wrong, you have to GUNZIP that file, anyway).

All it really does is make an http:// request for you through the perl module LWP. (When I was looking to download the file the first time I couldn't find an FTP address, and I can't run a browser on my Unix box.)







------------------
POSTCARDS.COM -- Everything Postcards on the Internet www.postcards.com
LinkSQL FAQ: www.postcards.com/FAQ/LinkSQL/







Quote Reply
Re: Dmoz.org Import Question In reply to
PugDog,

Can you do me a favor, I don't have enough space to expand the whole directory. Can I use your copy of the directory to copy the /World/Espanol part of the directory?

------------------
James L. Murray
VirtueTech, Inc.
www.virtuetech.com


Quote Reply
Re: Dmoz.org Import Question In reply to
Let me see if I still have it, if I do I'll zip out that area.

It's a BIG file....
Quote Reply
Re: Dmoz.org Import Question In reply to
You can get the file at:

http://www.postcards.com/FAQ/LinkSQL/files/top_world_Esp.rdf.gz

it's just the Top/World/Espanol area

I'll keep it there a few days.


------------------
POSTCARDS.COM -- Everything Postcards on the Internet www.postcards.com
LinkSQL FAQ: www.postcards.com/FAQ/LinkSQL/







Quote Reply
Re: Dmoz.org Import Question In reply to
Thank you so much PugDog. You are a life saver!



------------------
James L. Murray
VirtueTech, Inc.
www.virtuetech.com


Quote Reply
Re: Dmoz.org Import Question In reply to
PugDog
I uploaded the file to my /admin/setup directory and tried to unzip it by typing in at telnet: gzip -d top_world_Esp_rdf.gz

It said it wasn't a gz file. How do I work this. I tried just renaming the file to top_world_Esp.rdf and then using Parse_RDF.pl and it didn't work either.

I could really use some help. Thanks.

------------------
James L. Murray
VirtueTech, Inc.
www.virtuetech.com


Quote Reply
Re: Dmoz.org Import Question In reply to
Try the gunzip utility. I'm not sure which version of the program I used.

gunzip top_world_Esp_rdf.gz

I checked the file and it unzips for me.

Quote Reply
Re: Dmoz.org Import Question In reply to
Also make sure you ftp'd it in binary mode..

Cheers,

Alex
Quote Reply
Re: Dmoz.org Import Question In reply to
Hello,

Ok. I re-uploaded the script in Binary and used the gunzip method, but it still says that the file is not in gzip format.

????
Quote Reply
Re: Dmoz.org Import Question In reply to
Try this:

Download the .gz to your local PC, use WinZip to gunzip it, and upload the .rdf to your server.

Then just run Parse_RDF.pl.

Hope that helps,
Emilio
Quote Reply
Re: Dmoz.org Import Question In reply to
try it now.

I don't know if the server was sending the file header properly to the browser. You should be prompted to download.

Quote Reply
Re: Dmoz.org Import Question In reply to
Hi:

Got it D/L'ed fine.... expanded fine... got Parce_RDF running.... but it always QUITS at about Top/Arts/Music/Bands...

Any ideas? I say the 5000 limit on topics, I have increased that to 9999999 and it still quits. It goes through a bunch of topics, says skipping.... and then QUIT at about record 5688. Always a different place... and never getting to the data I want


Dave
Quote Reply
Re: Dmoz.org Import Question In reply to
Ok, so I spent like 7 hours parasing the file fr th entire Referance directory last night but when I went to 're-index' and 'build all' today, not a single link or category showed up. Any idea what might have caused the links and categories not to save but no error to be given in telnet?

------------------
Ted Sindzinski
www.infinityinternet.com


[This message has been edited by infinity (edited December 31, 1999).]
Quote Reply
Re: Dmoz.org Import Question In reply to
Hi
I have just downloaded the whole content.rdf

It's quite big 80.7Mb Expanded to 477Mb.

I extracted this and then opened it without any difficulty using "Arachnophilia 3.9" Careware.

It's quite amazing 5mins to open but once open you can scroll up and down and edit the file as if it's a 20K file.

Tony
Quote Reply
Re: Dmoz.org Import Question In reply to
I think it's a new rdf format.... Alex was going to check on it, and I don't think he's reported back on it yet.

Two things could have happened -- one, no links were read in at all.

THe other is they were all added to category "0" and will show up if you search for lost/bad links.

Of course, that won't help you at all, since moving that many links by hand is an unrealistic task.



------------------
POSTCARDS.COM -- Everything Postcards on the Internet www.postcards.com
LinkSQL FAQ: www.postcards.com/FAQ/LinkSQL/