Gossamer Forum
Quote Reply
DMOZ Dump Importer...
I'm currently writing a Plugin (which I'm very proud of so far), which will automate the process of doing a DMOZ dump. If you do a couple of DMOZ dumps a week, you will know what a pain in the arse it can be Frown

Basically, the features are as follows;

+ It downloads the content.rdf.u8.gz file, and decompresses it.
+ Slices up the RDF file into smaller peices.
+ You can decide to either run a complete DMOZ dump (where you can choose the main categories to do), or just specific categories, and the main RDF content slices location.
+ Will email you after each section, to let you know how it went. This is an optional choice.
+ Option to clean out database completly upon the start of the dump.
+ Option to import a SQL Dump from another server (or location on your server), and insert that before starting the import process.

What does everyone think about this? Price range will be about $150, because its obviously a niche market. Also, I don't want to sell too many copies, or I'll loose a lot of import jobs Tongue

If no-one is interested, thats cool. I just wanted to know if anyone was interested, and if so, I can start writing some documentation, and make it a bit more portable.

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] DMOZ Dump Importer... In reply to
Andy;

Now THAT's a great idea for a plugin!

$150? Pricewise, I'd buy it at $75-89, doubtful at over $100. That would be too steep. I don't think you will loose business on the import /slicing side if you don't price it high enough! You will gain business because more people will realize that you offer DMOZ import/slice services!

Features:

- There should be a option for full manual approval of each new link.

- Can you make it so that it does an intelligent add to the present Links Database, without requiring a SQL dump?

- For full usability it would check each entry to see if the same or a SIMILAR url is in the database (this is important often) and ask if it should add, update (add in any missing fields like email) or replace while showing Title, Description, category, email, etc.

That way, you could do a quick weekly check to see if there are new listings.
Quote Reply
Re: [webslicer] DMOZ Dump Importer... In reply to
On a slightly related note I was actually thinking of writing a script to do something similar but not quite.

I'm going to setup a script to download the dmoz content file weekly and cut it up into all the seperate categories and will then make a publically accessible domain where people can download any of the categories they want.
Quote Reply
Re: [webslicer] DMOZ Dump Importer... In reply to
I suppose $80 would be ok, I'll see at the end what kinda price range it is worth Smile

>>>- Can you make it so that it does an intelligent add to the present Links Database, without requiring a SQL dump? <<<

I'm only utalising nph-import.cgi, but my plugin allows you to define any number of categories, and it will then do all the required prep stuff to the database, install any existing data you want in it, and more :)

>>>- For full usability it would check each entry to see if the same or a SIMILAR url is in the database (this is important often) and ask if it should add, update (add in any missing fields like email) or replace while showing Title, Description, category, email, etc. <<<

nph-import.cgi checks to see if a link exists in the same category (when defining --nph-update), so this shouldn't be a feature that needs to be added.

I'll keep people updated on the progress Smile

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] DMOZ Dump Importer... In reply to
$180? Are you sure? LSQL is only $450..
Quote Reply
Re: [xpert] DMOZ Dump Importer... In reply to
>>>$180? Are you sure? LSQL is only $450.. <<<

Whats that got to do with the price of fish? Tongue GT's PPC pluginis $1500, which is 3 times+ the price of LSQL. Its all about demand and supply. If I make it cheap, then I may will make more sales. If I make it expensive, then I may make less sales, but possibly make more. Its a balance between me needing to earn enough money; and making it available to the customer at a reasonable rate. I think $80 is well in the board range for something like this (considering for a full DMOZ dump you can pay easily over $150)

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] DMOZ Dump Importer... In reply to
Hi, Andy;

But you have to do a SQL dump before each import the way you have it set, correct?

Can you import multiple categories at once?

>nph-import.cgi checks to see if a link exists in the >same category (when defining --nph-update), so >this shouldn't be a feature that needs to be added.

What does it do if the link exists? I'm saying that when a link exists, it would be very useful to have the choice: add or skip (kinda like FTP).
Quote Reply
Re: [webslicer] DMOZ Dump Importer... In reply to
>>>But you have to do a SQL dump before each import the way you have it set, correct? <<<

Its an option. If you don't want to import some existing SQL Dump data into the database, then you can leave it blank, and it won't do anything.


>>>Can you import multiple categories at once? <<<

Sure. Thats one of the main things this plugin does Smile (see attached image for an example of the GUI)

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] DMOZ Dump Importer... In reply to
That portion looks good, Andy;

Is there a SQL Dump entry field there?

Again, What does nph-import do if the link exists? I'm saying that when a link exists, it would be very useful to have the choice: add or skip (kinda like FTP).
Quote Reply
Re: [Paul] DMOZ Dump Importer... In reply to
Paul;

I've been trying to reach you, can you check your PM's?
Quote Reply
Re: [webslicer] DMOZ Dump Importer... In reply to
Quote:
Is there a SQL Dump entry field there?

Yeah, it got cut off in the screenshot. There is a 'file path' and 'file URL' option (don't need to use either, but one or the other can be chosen)

Quote:
Again, What does nph-import do if the link exists? I'm saying that when a link exists, it would be very useful to have the choice: add or skip (kinda like FTP).

Yeah, that would take a long time to do though :( Imagine, if you are importing 1,000 links, and each link needs to be checked against every other one. That about 1,000,000 checks ! Unimpressed

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] DMOZ Dump Importer... In reply to
Andy;

No, don't think so.

Only the Dmoz data is slow to search because there is no index.

In this case, you would be doing a database lookup or search on the Links Sql side!

Note: A basic binary search of 1,000 items should take less than 10 seeks (2 to the 10th is 1,024) to check the whole database, not 1,000.
Quote Reply
Re: [webslicer] DMOZ Dump Importer... In reply to
Trust me, it would be slow Wink I have already tried doing something similar by modifying the .pm for nph-import.cgi, so that it did a check before importing. It took 3 hours to do a small category of 4,000 Frown

The current version is gonna be ready for beta testing within the next day/day-and-a-half. I'm just putting finishing touches to it, fixing a few known bugs, and then writing some documentation.

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Paul] DMOZ Dump Importer... In reply to
Paul,

I have something like that. It was a modification of a simple script posted here years ago.

It's brain dead, in that it does no error checking, and relies on a known topology of the main categories for it's checking.

The part I never got to, and which seems to be the sticking point, is cutting up sub categories, finding subcategories or links via keywords, *before* import, and duplicate checking (but updating descriptions).

I've been toying with something that would search for keywords, and cut out links that matched in some way, but it did not preserve any form of categories, it just cut out the data segments. (a full-text search with extract).

I'm *really* surprised there is no group or open souce project for doing this, such as on Source Forge. I check every few months, and with the DMOZ .rdf project getting so cumbersome, it's always surprised me that no project for managing the .rdf dump was ever undertaken.

From a programming point of view, it's a _prime_ project, since it's extremely well defined, it uses AWESOME amounts of time and computing power in the raw form, and only tiny portions are considered "relevant" at any given time. It's a hackers dream project <G>


PUGDOG� Enterprises, Inc.

The best way to contact me is to NOT use Email.
Please leave a PM here.
Quote Reply
Re: [pugdog] DMOZ Dump Importer... In reply to
You can see my DMOZ Splitter thread for more details on this. It extracts the category data in it's proper heirarchy so you end up with a gzipped file for the top level category containing all data for that category, then the sub-categories of the top level categories are all gzipped containing that specific tree, and so on.

I'm actually really pleased with the outcome :)

It's been running about 6 hours today and is nealy done. It uses up about 5GB of disk space.

Here's a capture...

Regional.part is not gzipped as it's currently being parsed....
Quote Reply
Re: [pugdog] DMOZ Dump Importer... In reply to
Quote:
I'm *really* surprised there is no group or open souce project for doing this, such as on Source Forge.

There is now =)
Quote Reply
Re: [Andy] DMOZ Dump Importer... In reply to
I'm looking for one more person to test the DMOZ_Wizard Plugin. Please be aware that you will need high server resources, or a dedicated server, to be able to run this plugin. Please PM/Email me if interested.

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!