Gossamer Forum
Home : Products : Gossamer Links : Discussions :

Importing DMOZ into LinksSQL using custom application

Quote Reply
Importing DMOZ into LinksSQL using custom application
I developed my own python app to parse through the dmoz content rdf file and write all links to a CSV file.

I'm curious, how do I go about getting this into the correct MySQL tables used by LinksSQL.

Note, that I'm not talking about how to get the contents of the CSV file into MySQL, that is a piece of cake, and I've already inserted all 3 million plus links into a table I designed in approximately 4 minutes.

What I need to know, is which tables are involved, and is it even possible? I really don't have time to pick apart import.cgi. But perhaps that is what I have to do.

TIA!

Side note: One thing for the newbies here, never rely on a software program to insert into a database if possible. Always write to a CSV file and use the database's load application to load the CSV file. It is several times faster to do it this way rather than via insert statements. A java app I wrote to parse and insert urls from the DMOZ rdf file maxed out at 10 inserts per second. If you do the math, you'll see that loading the entire DMOZ in about 4 minutes blows that away by days!
Quote Reply
Re: [takacsj] Importing DMOZ into LinksSQL using custom application In reply to
Is it really worth writing your own application? It requires writing to lsql_Links, lsql_CatLinks and lsql_Category. Surely it would be easier to just set cron to run nph-import.cgi? The way I do it, is to slice up the original content.rdf.u8 file into 17 smaller files. Then I write a perl script to execute each command for the appropriate cateogorys. Something like;

Code:
#!/usr/bin/perl

print "Content-type/text/html \n\n";

# helps us catch nasty errors
use CGI::Carp qw(fatalsToBrowser);

my @cats = qw(Top_Arts Top_Business Top_Computers Top_Games Top_Health Top_News Top_Reference Top_Regional Top_Science Top_Shopping Top_Sports Top_Home);

my $counting = 0;
foreach my $var (@cats) {

chomp;
$counting++;

my ($real,$des) = $var;
$real =~ s,_,/,sig;
$des =~ s,Top,,;
print qq| Running 'perl ./admin/nph-import.cgi --import=RDF --source=science.dump.slice --destination=./admin/defs --rdf-category="$real" --rdf-add-date="2001-01-01" --rdf-destination="$des" --rdf-update' \n\n|;
`perl ./admin/nph-import.cgi --import=RDF --source=science.dump.slice --destination=./admin/defs --rdf-category="$real" --rdf-add-date="2001-01-01" --rdf-destination="$des" --rdf-update`;
print "done $counting\n";

}

That will go through, and run each command seperatly (thus saving server speed, overall CPU, and especially memory). To slice up the content.rdf.u8 file, I use the following script;

Code:
#!/usr/bin/perl

print "Content-type: text/html \n\n";

# helps us catch nasty errors use CGI::Carp qw(fatalsToBrowser);

$full = 1; # if only wanting everything bar regional and world...use this!

######################################################
# GET THE DUMP FILE STYARTS HERE #####################
######################################################

# get rid of the old file... #

# unlink "content.rdf.u8";

# $main_rdf_start_time = time;

# `wget --no-directories http://dmoz.org/rdf/content.rdf.u8.gz`;

# `gzip -d content.rdf.u8.gz`; # finished with raf.u8.gz, so delete now...keep space!

# unlink "content.rdf.u8.gz";

#$main_rdf_end_time = time;

#$main_rdf_total_time = $main_rdf_end_time - $main_rdf_start_time;

# open(MAIL,"|/usr/sbin/sendmail -t") || die &error("Unable to open Sendmail. Reason: $!");
# $webmaster = 'webmaster@ace-installer.com';
# print MAIL "To: $webmaster \n";
# print MAIL "From: $webmaster \n";
# print MAIL "Reply-to: $webmaster \n";
# print MAIL "Subject: RE Dump... \n\n";
# print MAIL "content.rdf.u8.gz has successfully been downloaded and decompressed. Took $main_rdf_total_time\n";
# print MAIL "\n \n Thanks";
# print MAIL "\n";
# print MAIL "A.J.Newby \n";
# print MAIL "Ace Installer \n";
# close(MAIL);

###################################################
### END THE GETTING OF THE MAIN DUMP FILE #########

###################################################

##################################################
### CUT THE DUMP INTO 17 SMALLER CATEGORIES ######
##################################################

$categories = "Top\/Adult::Top\/Arts";
$categories .= "~Top\/Arts::Top\/Business";
$categories .= "~Top\/Business::Top\/Computers";
$categories .= "~Top\/Computers::Top\/Games";
$categories .= "~Top\/Games::Top\/Health";
$categories .= "~Top\/Health::Top\/Home";
$categories .= "~Top\/News::Top\/Recreation";
$categories .= "~Top\/Reference::Top\/Regional";
$categories .= "~Top\/Regional::Top\/Science";
$categories .= "~Top\/Science::Top\/Shopping";
$categories .= "~Top\/Shopping::Top\/Society";
$categories .= "~Top\/Sports::Top\/World";
$categories .= "~Top\/Home::Top\/Kids_and_Teens";

@categories = split("~", $categories); # now loop through them all....

foreach (@categories) {
@aaa = split("::", $_);
$start_line = $aaa[0];
$end_line = $aaa[1];
$file_save = lc($start_line);
$file_save =~ s/Top//i; # open up the main dmoz dump u8 file

open(DMOZ, "./content.rdf.u8") || &error("Unable to read dump file. Reason: $!"); # category
open(CLEAN_DUMP, ">./$file_save.dump.slice");
print CLEAN_DUMP ""; close(CLEAN_DUMP); # to make the file blank...
open(DUMP_FILE, ">>./$file_save.dump.slice") or &error("cant do it: $! : ./$file_save.dump.slice"); # open ready for input....

# start a while..not closed til right near the end...
$do = 0;
while (<DMOZ>) {
# doing the arts category only needs this...then if the lines matches the regex we are moved onto the next category..
# check to see when we wanna start, otherwise use next;
if ($start_line) {
if ($_ =~ /<Topic r:id=\"$start_line\">/) { $do = 1; }
}
if ($_ =~ /<Topic r:id=\"$end_line\">/) { close(DUMP_FILE); &import_done_email($start_line); last; }
else { if ($do) { print DUMP_FILE "$_\n"; } }
} # end the while

close(DMOZ); # close up the main file...

} # end the foreach


sub import_done_email {

my $cat = shift;
open(MAIL,"|/usr/sbin/sendmail -t") || die &error("Unable to open Sendmail. Reason: $!");
$webmaster = 'webmaster@ace-installer.com';
print MAIL "To: $webmaster \n";
print MAIL "From: $webmaster \n";
print MAIL "Reply-to: $webmaster \n";
print MAIL "Subject: RE Main $cat Dump... \n\n";
print MAIL "$cat has now been inported into the SQL database.... \n";
print MAIL "\n \n Thanks";
print MAIL "\n";
print MAIL "A.J.Newby \n";
print MAIL "Ace Installer \n";
close(MAIL);
}


# error incase stuff goes wrong...
sub error {
my ($error) = shift;
print $error; exit;
}

Its pretty customized for my own use, but it should give the right idea (also pretty old code....but it still slices the whole rdf file into 17 smaller files in about 10-15 mins).

Just a suggestion, cos my idea has been tried and tested, and it seems to work very well [p:)]

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Importing DMOZ into LinksSQL using custom application In reply to
Andy,

Thanks a bunch for sharing your code. By the way, your perl looks too perfect to be perl! Wink

I'll give this more thought and see what I decide. If I re-write your code in python, I'll post it back here for the benefit of the community.

John

One question: If I insert the appropriate data into the tables you mention, does that take care of everything? In other words, is there any processing that nph-import.cgi does in addition to simply inserting data into the tables you mention? So if I insert the data into those tables with my python app, and then look at my index page for links sql, I'll see all categories displayed, and I'll be able to drill down and see all sub categories and all urls etc etc?

I recall I had a problem getting the number of links to display when I attempted to do this about a year ago. But that was with nph-import.cgi. Took forever too.

Last edited by:

takacsj: May 8, 2003, 10:15 AM
Quote Reply
Re: [takacsj] Importing DMOZ into LinksSQL using custom application In reply to
Quote:
By the way, your perl looks too perfect to be perl!

Laugh ... you gotta be joking, right? (its about a year old)

Quote:
I recall I had a problem getting the number of links to display when I attempted to do this about a year ago. But that was with nph-import.cgi. Took forever too.

That will be because you need to do a 'repair tables' after the import. That should re-do all your category counts.

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!