Gossamer Forum
Home : Products : Gossamer Links : Discussions :

DMOZ Category Slices

Quote Reply
DMOZ Category Slices
I can download and gunzip the DMOZ RDF file in less than 20 minutes. I've written a (very) small CGI-Perl script (run locally - to be moved to site web server when testing complete) that slices the RDF file into separate files (18) - one file per top category. Splitting into subcategories is just as easy, and will be available by request. The script takes about 5 minutes to run. The job from start to end, takes less than half an hour to complete. I'll be making the top category files available in the next week or so (pending completion of other higher priority projects). The files will be updated weekly (automatically). There will be a fee (payable via PayPal). But given the ease of the job, the fee will be nominal - mainly to offset bandwidth costs, and to compensate for time involved. I'll announce details regarding where and when the files can be downloaded - announcement will be as reply to this message.



Dan Cool

----
Cheers,

Dan
Founder and CEO

LionsGate Creative
GoodPassRobot
Magelln
Quote Reply
Re: [dan] DMOZ Category Slices In reply to
So you're not gonna give us a peek at your code? Wink

I'll show you mine if you show me yours Laugh

(10 lines)

Last edited by:

RedRum: Jan 10, 2002, 6:24 PM
Quote Reply
Re: [RedRum] DMOZ Category Slices In reply to
Sounds like our scripts are more or less identical in terms of algorithm and code. Mine's 15 lines, including several output lines, as it is run from the browser - but of course some simple obfuscation and trimming (e.g., removal of some output lines) can whittle it down a few lines. Remarkably simple and fast scripts, eh! I tried the VEDIT program but what a nightmare. I've got 256MB RAM (not much - gotta expand to 1GB), but it chewed that up really quickly when I loaded the RDF file and attempted to navigate the file in order to parse it (although in all fairness, my memory load is usually around 60-65%).



Cheers - Dan Cool

----
Cheers,

Dan
Founder and CEO

LionsGate Creative
GoodPassRobot
Magelln
Quote Reply
Re: [dan] DMOZ Category Slices In reply to
Yeah I tried editplus to open the rdf file and it crashed :)

>>Remarkably simple and fast scripts, eh!<<

Yup....it was easier than I thought. I'm using some slightly altered code for my own script which extracts 100 files from one big compressed file (not tar but my own compression Angelic )

Its pretty crap compression I might add but Im proud that it works. I got a 420kb file down to 400kb lol
Quote Reply
Re: [RedRum] DMOZ Category Slices In reply to
5% compression is not the best (of course it depends on the file type you are compressing), it is a start and an excellent / entertaining exercise at the very least as you revise and tweak your compression algorithm. Good work!

I was considering selling not the spliced files per se, but rather bandwidth access to the files (X dollars per X MB download - via web, or Telnet using wget). More flexible, and less likely to be abused. For example, say it was $10 per 100 MB (pulled out of ther air, and should not be considered a proposed or projected fee schedule). Now person A purchases $10 or 100 MB worth of download. Person A could download one 100 MB file, or ten 10 MB different files - or any combination thereof that adds up to the amount they purchased (in this example, 100 MB). Or they could download the same 10 MB file ten times - one per month for ten months for updates (so no time restrictions on download account). Now the question is how to restrict access on the basis of bandwidth (similar as 'free' web hosts do) - preferably without crunching server logs.

Cheers - Dan Cool

----
Cheers,

Dan
Founder and CEO

LionsGate Creative
GoodPassRobot
Magelln
Quote Reply
Re: [dan] DMOZ Category Slices In reply to
Update: http://www.monster-submit.com/scripts/dmoz/

Hope to have DMOZ category cuts available by week's end.



Cheers - Dan Cool

----
Cheers,

Dan
Founder and CEO

LionsGate Creative
GoodPassRobot
Magelln