Gossamer Forum
Home : Products : Gossamer Links : Discussions :

DMOZ Splitter

Quote Reply
DMOZ Splitter
I just had a bit of spare time before my dinner so wrote a script to split the 1GB DMOZ data file into seperate files. It will parse any category you desire and dump it into a seperate file.

It should be run from the command prompt (if using windows) or a telnet/ssh window (if using *nix).

Usage is pretty simple....

Code:
perl rdf_parse.pl --rdf=/path/to/content.rdf.u8 --out=/path/to/new.file --cat=Top/Arts

Just do:

perl rdf_parse.pl

...and you'll get a summary of options.

Hope someone finds it useful.
Quote Reply
Re: [Paul] DMOZ Splitter In reply to
thanks alot your script.

my MSN: perlchina_at_hotmail.com
Quote Reply
Re: [Paul] DMOZ Splitter In reply to
After some serious blood, sweat, tears and bandwidth usage, I've managed to write a script that will fetch the DMOZ content and extract *all* categories heirarchically. This means that it extracts into the following structure....

Code:
Top/
Top/Adult.part.gz # All Top/Adult data.
Top/Adult/Arts.part.gz # All Top/Adult/Arts data
Top/Adult/Business.part.gz # All Top/Adult/Business data

...and so on.

For an example of what I mean, go here....

http://dmoz.perlwhirl.com/

Is anyone interested in me using this script to provide weekly updated content?

Last edited by:

Paul: Jun 26, 2003, 6:36 AM
Quote Reply
Re: [Paul] DMOZ Splitter In reply to
Paul;

Sounds good!

How do we test this, and how do we add these up to date DMOZ links into an existing Links SQL database without damaging what is there, assuming Links SQL 2.12?

That's the million dollar question, if you ask me!
Quote Reply
Re: [webslicer] DMOZ Splitter In reply to
Quote:
how do we add these up to date DMOZ links into an existing Links SQL database without damaging what is there, assuming Links SQL 2.12

You add the data using nph-import.cgi and use --rdf-update so the data appends to your database.
Quote Reply
Re: [Paul] DMOZ Splitter In reply to
I would like to try that outCool

Can you give an exact example for updating from DMOZ to an existing database, please?

Pick any of the slices that you have here:
http://dmoz.perlwhirl.com/Top/Adult
Quote Reply
Re: [Paul] DMOZ Splitter In reply to
it great idea.

thanks paul

my MSN: perlchina_at_hotmail.com
Quote Reply
Re: [tsingson] DMOZ Splitter In reply to
I've password protected the directory now as I'm going to do a full parse of the content file into thousands of categories.

The data itself is free however as some of the files are fairly large then I need to cover my bandwidth costs.

So, I'm thinking of setting up a webpage so that people can select how long they want access to the data for an an appropraite cost will be calculated (it will be about $10/week). A username and password will be generated automatically and you can then access the data.

Does that sound reasonable?

I'll also create a page with full instructions on how to import the data into Links SQL.
Quote Reply
Re: [Paul] DMOZ Splitter In reply to
That sounds excellent.

Paul, we have all needed this for a long, long time.

$10 per week is more than reasonable.

You take Pay Pal or ??
Sign me up NOW!
Quote Reply
Re: [webslicer] DMOZ Splitter In reply to
Here is the home page with import and payment details =)

http://dmoz.perlwhirl.com/
Quote Reply
Re: [Paul] DMOZ Splitter In reply to
OK, Did it!Cool

I'm signed up for a month through PayPal for $40
Quote Reply
Re: [webslicer] DMOZ Splitter In reply to
Merci beaucoup :) ...it will go towards the repairs for my smashed up car Frown

Last edited by:

Paul: Jun 27, 2003, 9:07 AM
Quote Reply
Re: [Paul] DMOZ Splitter In reply to
NO problem. just dont smash the fixed one, oK??

Nice instruction page - looks easy.

2 quick q's:

What does the --rdf-update flag do when the url exists in the database already? Overwrite? Skip? ( expect) Is it a settable behaviour?

Question 2:
Can we import to a different tree root directory and then move the links over? How? Is there a mass move utility?
Quote Reply
Re: [webslicer] DMOZ Splitter In reply to
--rdf-update will append all data to the end of your database. It doesn't handle duplicates, but in the admin panel there is a "Duplicates" link letting you identify and remove duplicates from your database.

To import into a different root category, use:

--rdf-destination="Foo/Bar"

To move links from there use the category browser in the admin panel.
Quote Reply
Re: [webslicer] DMOZ Splitter In reply to
Quote:
--rdf-update will append all data to the end of your database. It doesn't handle duplicates, but in the admin panel there is a "Duplicates" link letting you identify and remove duplicates from your database.

Just so you know, this puts the import time up a lot, and also slows down the server more, cos it has to access the database and check for a duplicate in the category. Just letting you know, in case you are on a shared or a low resource server Tongue

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] DMOZ Splitter In reply to
Andy;

Good point and Thanks, but no problem there. Smile
We would wait for hours if needed - as the functionality is required, not optional!

Hardware wise, we are fast enough. Fortunately, we are on a dedicated dual processor rackmount Dell server with SCSI 160 drives running at 10k and 15k rpm.
And we're running FreeBSD, not a Windows variant, and have a pretty recently updated MySQL.

You're on Windows, right?Tongue
Quote Reply
Re: [Paul] DMOZ Splitter In reply to
Paul;

Sorry, but that's not a good or usable option.

Duplicate removal of a couple thousand links is not practical.

Inserting a couple of thousand links into a different category also is not practical as there is no way of doing more than 10 or so multiple inserts/updates at once (unless I'm missing something). Each does have a checkbox, but checking each one at a time and going from page to page is far too much typing!


We really need a routine or plugin that skips duplicates! Can you make it?
Quote Reply
Re: [webslicer] DMOZ Splitter In reply to
Quote:
Inserting a couple of thousand links into a different category also is not practical as there is no way of doing more than 10 or so multiple inserts/updates at once (unless I'm missing something). Each does have a checkbox, but checking each one at a time and going from page to page is far too much typing!

The category browser lets you move thousands of links at once AFAIK.

I can write a small script to remove duplicate links.
Quote Reply
Re: [Paul] DMOZ Splitter In reply to
Paul;

It really is needed, to make your service usable for updating, so if you can write the script, that would be fantastic. As I just subscribed to your DMOZ slice site, I definitely need it!Wink

Move Links?
Uh, I said "Insert", not "Move".

The category move will not put the links into an existing category. It will only make a sub-category, then you will still have to move the links one at a time.

Links should have an insert function or a move links function. It really is move categoryWink
Quote Reply
Re: [webslicer] DMOZ Splitter In reply to
I've opened a sourceforge project in relation to DMOZ and parsing the RDF file:

https://sourceforge.net/projects/dmoz-rdf/

I'll be (hopefully) adding scripts to perform these tasks, and other tasks (such as extraction by keyword) as time goes by.

Last edited by:

Paul: Jun 27, 2003, 1:39 PM
Quote Reply
DMOZ Splitter Duplicate Checker In reply to
Hi, Paul;


The DMOZ Split Data downloads went well from your site. I have to say that it is very Cool convenient to have all the sub categories available the way you have done it.

Now, in order for us to use these, I need to ask if you got a chance to code the duplicate import checker. If not, can you work on this?

I hope that you can make the checker avoid/skip similar (using LIKE, I guess) links, as many of our links have an affiliate code attached that would not be in DMOZ.

Ideally, we would get a "review" - overwrite/skip/ choice when a duplicate comes up, showing existing and replacement descriptions, titles and url's. This would be only if "overwrite all" or "skip all" is not selected.
Quote Reply
Re: [Paul] DMOZ Splitter In reply to
hello paul,

I've get the following error using this script:

Name "main::DATA used only once: possible typo at

c:\programme\apache2\cgi-bin\rdf_parse.pl line 68

can you help please?

Best regards from
Bremen/Germany

Lothar
Quote Reply
Re: [webslicer] DMOZ Splitter Duplicate Checker In reply to
^ bump!

... I need the scripts so that I can import your DMOZ slices I subscribed to Cool
Quote Reply
Re: [eljot] DMOZ Splitter In reply to
>>>I've get the following error using this script:

Name "main::DATA used only once: possible typo at

c:\programme\apache2\cgi-bin\rdf_parse.pl line 68 <<<

What version of Perl are you using?

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] DMOZ Splitter In reply to
Hello Andy,

i' use active perl under windows xp

It's not my webserver, it's a local testsystem.

Best regards from
Bremen/Germany

Lothar