Is it really worth writing your own application? It requires writing to lsql_Links, lsql_CatLinks and lsql_Category. Surely it would be easier to just set cron to run nph-import.cgi? The way I do it, is to slice up the original content.rdf.u8 file into 17 smaller files. Then I write a perl script to execute each command for the appropriate cateogorys. Something like;
That will go through, and run each command seperatly (thus saving server speed, overall CPU, and especially memory). To slice up the content.rdf.u8 file, I use the following script;
Code:
#!/usr/bin/perl
print "Content-type: text/html \n\n";
# helps us catch nasty errors use CGI::Carp qw(fatalsToBrowser);
$full = 1; # if only wanting everything bar regional and world...use this!
######################################################
# GET THE DUMP FILE STYARTS HERE #####################
######################################################
# get rid of the old file... #
# unlink "content.rdf.u8";
# $main_rdf_start_time = time;
# `wget --no-directories http://dmoz.org/rdf/content.rdf.u8.gz`;
# `gzip -d content.rdf.u8.gz`; # finished with raf.u8.gz, so delete now...keep space!
# unlink "content.rdf.u8.gz";
#$main_rdf_end_time = time;
#$main_rdf_total_time = $main_rdf_end_time - $main_rdf_start_time;
# open(MAIL,"|/usr/sbin/sendmail -t") || die &error("Unable to open Sendmail. Reason: $!");
# $webmaster = 'webmaster@ace-installer.com';
# print MAIL "To: $webmaster \n";
# print MAIL "From: $webmaster \n";
# print MAIL "Reply-to: $webmaster \n";
# print MAIL "Subject: RE Dump... \n\n";
# print MAIL "content.rdf.u8.gz has successfully been downloaded and decompressed. Took $main_rdf_total_time\n";
# print MAIL "\n \n Thanks";
# print MAIL "\n";
# print MAIL "A.J.Newby \n";
# print MAIL "Ace Installer \n";
# close(MAIL);
###################################################
### END THE GETTING OF THE MAIN DUMP FILE #########
###################################################
##################################################
### CUT THE DUMP INTO 17 SMALLER CATEGORIES ######
##################################################
$categories = "Top\/Adult::Top\/Arts";
$categories .= "~Top\/Arts::Top\/Business";
$categories .= "~Top\/Business::Top\/Computers";
$categories .= "~Top\/Computers::Top\/Games";
$categories .= "~Top\/Games::Top\/Health";
$categories .= "~Top\/Health::Top\/Home";
$categories .= "~Top\/News::Top\/Recreation";
$categories .= "~Top\/Reference::Top\/Regional";
$categories .= "~Top\/Regional::Top\/Science";
$categories .= "~Top\/Science::Top\/Shopping";
$categories .= "~Top\/Shopping::Top\/Society";
$categories .= "~Top\/Sports::Top\/World";
$categories .= "~Top\/Home::Top\/Kids_and_Teens";
@categories = split("~", $categories); # now loop through them all....
foreach (@categories) {
@aaa = split("::", $_);
$start_line = $aaa[0];
$end_line = $aaa[1];
$file_save = lc($start_line);
$file_save =~ s/Top//i; # open up the main dmoz dump u8 file
open(DMOZ, "./content.rdf.u8") || &error("Unable to read dump file. Reason: $!"); # category
open(CLEAN_DUMP, ">./$file_save.dump.slice");
print CLEAN_DUMP ""; close(CLEAN_DUMP); # to make the file blank...
open(DUMP_FILE, ">>./$file_save.dump.slice") or &error("cant do it: $! : ./$file_save.dump.slice"); # open ready for input....
# start a while..not closed til right near the end...
$do = 0;
while (<DMOZ>) {
# doing the arts category only needs this...then if the lines matches the regex we are moved onto the next category..
# check to see when we wanna start, otherwise use next;
if ($start_line) {
if ($_ =~ /<Topic r:id=\"$start_line\">/) { $do = 1; }
}
if ($_ =~ /<Topic r:id=\"$end_line\">/) { close(DUMP_FILE); &import_done_email($start_line); last; }
else { if ($do) { print DUMP_FILE "$_\n"; } }
} # end the while
close(DMOZ); # close up the main file...
} # end the foreach
sub import_done_email {
my $cat = shift;
open(MAIL,"|/usr/sbin/sendmail -t") || die &error("Unable to open Sendmail. Reason: $!");
$webmaster = 'webmaster@ace-installer.com';
print MAIL "To: $webmaster \n";
print MAIL "From: $webmaster \n";
print MAIL "Reply-to: $webmaster \n";
print MAIL "Subject: RE Main $cat Dump... \n\n";
print MAIL "$cat has now been inported into the SQL database.... \n";
print MAIL "\n \n Thanks";
print MAIL "\n";
print MAIL "A.J.Newby \n";
print MAIL "Ace Installer \n";
close(MAIL);
}
# error incase stuff goes wrong...
sub error {
my ($error) = shift;
print $error; exit;
}
Its pretty customized for my own use, but it should give the right idea (also pretty old code....but it still slices the whole rdf file into 17 smaller files in about 10-15 mins).
Just a suggestion, cos my idea has been tried and tested, and it seems to work very well [p:)]