Gossamer Forum
Home : Products : Gossamer Links : Discussions :

Split the DMOZ data?

Quote Reply
Split the DMOZ data?
Guys,

I have downloaded the (Dmoz data) content.rdf.u8 I would like to split the file in smaller pieces, one file per top category eg, Adult, Arts, Buisness, Computers etc. Any ideas about how it can and should be done correctly?

Best, Soobe



Quote Reply
Re: Split the DMOZ data? In reply to
Which categories do you want/need?

To do it yourself you can use nph-import.cgi

If you run it from telnet it will show you the commands needed.

Also examples are around this forum somewhere. Search for dmoz rdf

Mods:http://wiredon.net/gt/download.shtml
Installations:http://wiredon.net/gt/
Quote Reply
Re: Split the DMOZ data? In reply to
Actually I don't think you can import just section of the Dmoz file. If I remember correctly, the option are import all or update current data. You will need a custom made script to split each sections into smaller chunks.


________________________
Eraser
Insight Eye
http://www.insighteye.com
Quote Reply
Re: Split the DMOZ data? In reply to
Actually Eraser you can import any category you want. Don't have to import it all - that would be pointless for most.

Mods:http://wiredon.net/gt/download.shtml
Installations:http://wiredon.net/gt/
Quote Reply
Re: Split the DMOZ data? In reply to
Well then I guess I didn't remember correctly.

This is what he wants to import a specific category:

--rdf-category="Top/Category/Name


Still needs a separate script to break up sections into smaller pieces though which was the gist of the original post. PugDog has such a script, perhaps he would be willing to sell a copy.



________________________
Eraser
Insight Eye
http://www.insighteye.com
Quote Reply
Re: Split the DMOZ data? In reply to
In Reply To:
Still needs a separate script to break up sections into smaller pieces though which was the gist of the original post. PugDog has such a script, perhaps he would be willing to sell a copy.
No you don't. You can use nph-import.cgi to import a specific category from the FULL gzipped or uncompressed file. I've done it myself.

Mods:http://wiredon.net/gt/download.shtml
Installations:http://wiredon.net/gt/
Quote Reply
Re: Split the DMOZ data? In reply to
Splitting up the main file into separate rdf.gz files for which a script will be needed would be the way to go.

Original post:

"I would like to split the file in smaller pieces, one file per top category"

Tongue



________________________
Eraser
Insight Eye
http://www.insighteye.com
Quote Reply
Re: Split the DMOZ data? In reply to
Hmm....one of us is getting confused......I still stand by the fact that a seperate script isn't needed to split the file. nph-import will do everything required by soobe


Mods:http://wiredon.net/gt/download.shtml
Installations:http://wiredon.net/gt/
Quote Reply
Re: Split the DMOZ data? In reply to
We await his reply...




________________________
Eraser
Insight Eye
http://www.insighteye.com
Quote Reply
Re: Split the DMOZ data? In reply to
Hi,

No, you can import a specific category, however it currently parses everything, and isn't the quickes import script, so if you want World/something, you may be waiting a while unless you can cut out the part of the dump you want.

I just used vi, took about 10 minutes to do (7 minutes to load the file, 1 minute to find the beginning, 1 minute to find the end, and a minute to save).

Cheers,

Alex

--
Gossamer Threads Inc.
Quote Reply
Re: Split the DMOZ data? In reply to
As for the program I use, it's a hack (and a bad one!). It's pretty stupid and really just brute force opens the file, looks for certain strings, and writes out the file. I have to run it multiple times, when really I should be able to run it once, and generate all the cuts, but never had the time to do it.

I do have cuts available, where I've taken the top sections and put them into separate files, with the proper headers. Don't know how important that is, but it works. I try to generate them once a month, so my new sites have a fresher copy of the DMOZ.

Part of the problem for most people is simply the size of the files. They don't have enough disk space, or RAM to deal with them. The smaller "cuts" work better. They work better for general use as well.

As for vi, or any of the other editors, if you can open the file in a "read only" mode, so the program generates lower overhead, you might be able to mark the part you want and save to another file (I know joe and EMACS can do that, I've never gotten the hang of vi).

It would really be a good job for a summer intern (or fall intern) to take the import program, and mix with the parse routines, and be able to pre-parse, take out a cut of DMOZ then import that cut. It would be nice to do it all at once, but doing in several passes (like old style compilers) uses less resources, and allows setting up categories, import locations, and such that would require loads of RAM and linked lists to do in a single pass.

Anyway....

PUGDOGŪ Enterprises, Inc.
FAQ:http://LinkSQL.com/FAQ
Plugins:http://LinkSQL.com/plugin
Quote Reply
Re: Split the DMOZ data? In reply to
Thanks for all replies!

The main reason for me to split the file, is to get the import done quicker. Since the complete file is parsed it takes a hughe amount of time to import categories in the end of the file, like for example the World category. Using the comple file for a import of the World category takes up un necesarry time and resources.

I solved it this way, may not the smartest way but it works perfect:-)

1. To find out the stop and start for each category i run this shellscript.

#!/bin/sh
cat -n content.rdf.iso|grep '<Topic r:id="Top/Adult">' >>categories.id
cat -n content.rdf.iso|grep '<Topic r:id="Top/Arts">' >>categories.id
cat -n content.rdf.iso|grep '<Topic r:id="Top/Business">' >>categories.id
cat -n content.rdf.iso|grep '<Topic r:id="Top/Computers">' >>categories.id
cat -n content.rdf.iso|grep '<Topic r:id="Top/Games">' >>categories.id
cat -n content.rdf.iso|grep '<Topic r:id="Top/Health">' >>categories.id
cat -n content.rdf.iso|grep '<Topic r:id="Top/Home">' >>categories.id
cat -n content.rdf.iso|grep '<Topic r:id="Top/Kids and Teens">' >>categories.id
cat -n content.rdf.iso|grep '<Topic r:id="Top/Netscape">' >>categories.id
cat -n content.rdf.iso|grep '<Topic r:id="Top/News">' >>categories.id
cat -n content.rdf.iso|grep '<Topic r:id="Top/Recreation">' >>categories.id
cat -n content.rdf.iso|grep '<Topic r:id="Top/Reference">' >>categories.id
cat -n content.rdf.iso|grep '<Topic r:id="Top/Regional">' >>categories.id
cat -n content.rdf.iso|grep '<Topic r:id="Top/Science">' >>categories.id
cat -n content.rdf.iso|grep '<Topic r:id="Top/Shopping">' >>categories.id
cat -n content.rdf.iso|grep '<Topic r:id="Top/Society">' >>categories.id
cat -n content.rdf.iso|grep '<Topic r:id="Top/Sports">' >>categories.id
cat -n content.rdf.iso|grep '<Topic r:id="Top/World">' >>categories.id

Then i open the rdf file in vi and simply save parts of the file referenced to line numbers in categories.id

vi content.rdf.u8
: 100,200w newfilename

This will save line 100-200 in a new file called newfilename.

Best, Sobbe

Quote Reply
Re: [soobe] Split the DMOZ data? In reply to
Hi,
I've tried to splid Dmoz Category :
cat -n content.rdf.u8|grep '<Topic r:id="Top/World">' >>World.rdf.u8

but it do not split nothing....
why ????


Thanks in Advance
Bye From Italy
Quote Reply
Re: [fabio] Split the DMOZ data? In reply to
Or, you could just write a script (like me, Paul, and a couple of other people) have done Tongue That way you can set it on cron, it will download the file...unzip it, cut it into the appropriate slices, and then email you when its complete...and all of this is done at server speed Smile

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!