Gossamer Forum
Home : Products : Gossamer Links : Discussions :

HELP: Want to import ALL dmoz data

Quote Reply
HELP: Want to import ALL dmoz data
I've seen several postings here concerning the import of the dmoz.org .rdf dump and I have a few questions.

I would like to import ALL of the dmoz data, 600-700 MB of .rdf data and need help with the command line.

I've got the content.rdf.gz file (Linux machine) but it doesn't appear to be gzipped at all. The file size is 139 MB and had a .gz extension. If I CAT out the file it displays text, not compressed data. The FILE command also tells me that it's ASCII text. This is odd to me. If I get the file to my home PC, the .gz is the same file size as my Linux machine but uncompressed to 702 MB. No big deal, it could be something on the Unix machine.

The question I have is:
How do I import ALL of the dmoz data? I will more than likely import twice a month to keep the links fresh. After I populate the database once (with all data), do I need to re-import all of the data or can I provide a flag to only import the data that is NEW or has CHANGED?

Please provide command line parameters for both. I'm new to this and would appreciate if it could be spelled out exactly as needed. Once for FULL import and once for INCREMENTAL import.

Thank you !!!

Quote Reply
Re: HELP: Want to import ALL dmoz data In reply to
Are you sure your server can cope with millions of links?

Paul Wilson. Shocked
(Dont blame me if I'm wrong!)
Quote Reply
Re: HELP: Want to import ALL dmoz data In reply to
As a matter of interest, this is becoming a big "Why". The only real reason to do this, is like GT's reason -- to show it can be done.

There is a mound of garbage in their system. 90% of the adult links -- 50k+ are garbage. They are just dead or circular references to banner farms. I'm sure there are many other areas like that where "spam" prevails.

DMOZ's biggest value is picking up a set of nich links that _have_ been carefully cleaned and developed by editors. Selectively importing categories is much, much more useful than just trying to gulp the whole thing.

It's a lot of work to selectively import, but it pays off in the quality of your site. There is no reason to set up another generic portal, offering access to the same old spammish links that are in everone elses portal site. Create something with effort, thought, and care, and you'll have a viable product. Make a copy of DMOZ, and that is exactly what you'll have. A copy. Not the real thing.

Just some thoughts for the general public.

PUGDOGŪ
PUGDOGŪ Enterprises, Inc.
FAQ: http://LinkSQL.com/FAQ


Quote Reply
Re: HELP: Want to import ALL dmoz data In reply to
Good point but I think a lot of the people who import do it for the same reasons....for a sense of satisfaction.

Seeing "You have 1,000,000 resources to choose from"..is pretty satisfying and you can show it off..lol

Im just importing either the Computer or Business category for the db I am working on, but as a test I imported the Adult category to see how it worked. I used the adult category as it was the first in the content.rdf and so I didn't have to wait for Parse_RDF.pl to skip past the other categories.

It was surprisingly quick....I imported 64,000 links and 4,000 categories in about 15 minutes and built the directory from telnet in about 10 minutes.

Everything seems to work well.......nothing has slowed down noticebly.

Paul Wilson. Shocked
(Dont blame me if I'm wrong!)
Quote Reply
Re: HELP: Want to import ALL dmoz data In reply to
I may whack the adult stuff, if it's a bloat it's gone. It would be nice to have a large database like that. We've got a database server setup so the large DB can rest there.

My version is 2.0 SQL and uses nph-import.cgi not the rdf_parse.

Back to my original question!

Will someone provide me with the exact command line to import everything.

Much appreaciated.

Quote Reply
Re: HELP: Want to import ALL dmoz data In reply to
I already told you....

gzip -d /path/to/content.rdf.gz

THAT UNZIPS IT TO 700MB

Then DOWNLOAD Parse_RDF.pl and use

perl Parse.RDF.pl

Make sure you configured the variables in Parse_RDF.pl

YOU NEED PARSE_RDF.PL TO IMPORT FROM DMOZ......nph-import.cgi ISNT for importing from DMOZ

Paul Wilson.
NEW WEBSITE COMING SOON!!
Quote Reply
Re: HELP: Want to import ALL dmoz data In reply to
Man, I must be an idiot! I cannot find the file >>> Parse.RDF.pl << anywhere for download.

Will someone be kind enough to send me the URL?

Quote Reply
Re: HELP: Want to import ALL dmoz data In reply to
I think you have to login to the licensed users section and download it there.

Robert Blackstone
Webmaster of Scato Search
http://www.scato.com
Quote Reply
Re: HELP: Want to import ALL dmoz data In reply to
yes thats correct

A copy also comes with Linksql SQL 1.13 but I have had a few problems with it.

Paul Wilson.
http://www.wiredon.net