Gossamer Forum
Home : Products : Gossamer Links : Discussions :

Re: [Andy] DMOZ import question

Quote Reply
Re: [Andy] DMOZ import question In reply to
I down loaded the copy from my server. Here it is:

<br>
#!/usr/bin/perl <br>
<br>
print &quot;Content-type: text/html \n\n&quot;; <br>
<br>
# helps us catch nasty errors use CGI::Carp qw(fatalsToBrowser); <br>
<br>
$full = 1; # if only wanting everything bar regional and world...use this! <br>
<br>
###################################################### <br>
# GET THE DUMP FILE STYARTS HERE ##################### <br>
###################################################### <br>
<br>
# get rid of the old file... # <br>
<br>
# unlink &quot;content.rdf.u8&quot;; <br>
<br>
# $main_rdf_start_time = time; <br>
<br>
# `wget --no-directories http://dmoz.org/rdf/content.rdf.u8.gz`; <br>
<br>
# `gzip -d content.rdf.u8.gz`; # finished with raf.u8.gz, so delete now...keep
space! <br>
<br>
# unlink &quot;content.rdf.u8.gz&quot;; <br>
<br>
#$main_rdf_end_time = time; <br>
<br>
#$main_rdf_total_time = $main_rdf_end_time - $main_rdf_start_time; <br>
<br>
# open(MAIL,&quot;|/usr/sbin/sendmail -t&quot;) || die &amp;error(&quot;Unable
to open Sendmail. Reason: $!&quot;); <br>
# $webmaster = 'webmaster@assistantdirectors.com'; <br>
# print MAIL &quot;To: $webmaster \n&quot;; <br>
# print MAIL &quot;From: $webmaster \n&quot;; <br>
# print MAIL &quot;Reply-to: $webmaster \n&quot;; <br>
# print MAIL &quot;Subject: RE Dump... \n\n&quot;; <br>
# print MAIL &quot;content.rdf.u8.gz has successfully been downloaded and decompressed.
Took $main_rdf_total_time\n&quot;; <br>
# print MAIL &quot;\n \n Thanks&quot;; <br>
# print MAIL &quot;\n&quot;; <br>
# print MAIL &quot;A.J.Newby \n&quot;; <br>
# print MAIL &quot;Ace Installer \n&quot;; <br>
# close(MAIL); <br>
<br>
################################################### <br>
### END THE GETTING OF THE MAIN DUMP FILE ######### <br>
<br>
################################################### <br>
<br>
################################################## <br>
### CUT THE DUMP INTO 17 SMALLER CATEGORIES ###### <br>
################################################## <br>
<br>
$categories = &quot;Top\/Adult::Top\/Arts&quot;; <br>
$categories .= &quot;~Top\/Arts::Top\/Business&quot;; <br>
$categories .= &quot;~Top\/Business::Top\/Computers&quot;; <br>
$categories .= &quot;~Top\/Computers::Top\/Games&quot;; <br>
$categories .= &quot;~Top\/Games::Top\/Health&quot;; <br>
$categories .= &quot;~Top\/Health::Top\/Home&quot;; <br>
$categories .= &quot;~Top\/News::Top\/Recreation&quot;; <br>
$categories .= &quot;~Top\/Reference::Top\/Regional&quot;; <br>
$categories .= &quot;~Top\/Regional::Top\/Science&quot;; <br>
$categories .= &quot;~Top\/Science::Top\/Shopping&quot;; <br>
$categories .= &quot;~Top\/Shopping::Top\/Society&quot;; <br>
$categories .= &quot;~Top\/Sports::Top\/World&quot;; <br>
$categories .= &quot;~Top\/Home::Top\/Kids_and_Teens&quot;; <br>
<br>
@categories = split(&quot;~&quot;, $categories); # now loop through them all....
<br>
<br>
foreach (@categories) { <br>
@aaa = split(&quot;::&quot;, $_); <br>
$start_line = $aaa[0]; <br>
$end_line = $aaa[1]; <br>
$file_save = lc($start_line); <br>
$file_save =~ s/Top//i; # open up the main dmoz dump u8 file <br>
<br>
open(DMOZ, &quot;./content.rdf.u8&quot;) || &amp;error(&quot;Unable to read dump
file. Reason: $!&quot;); # category <br>
open(CLEAN_DUMP, &quot;&gt;./$file_save.dump.slice&quot;); <br>
print CLEAN_DUMP &quot;&quot;; close(CLEAN_DUMP); # to make the file blank...
<br>
open(DUMP_FILE, &quot;&gt;&gt;./$file_save.dump.slice&quot;) or &amp;error(&quot;cant
do it: $! : ./$file_save.dump.slice&quot;); # open ready for input.... <br>
<br>
# start a while..not closed til right near the end... <br>
$do = 0; <br>
while (&lt;DMOZ&gt;) { <br>
# doing the arts category only needs this...then if the lines matches the regex
we are moved onto the next category.. <br>
# check to see when we wanna start, otherwise use next; <br>
if ($start_line) { <br>
if ($_ =~ /&lt;Topic r:id=\&quot;$start_line\&quot;&gt;/) { $do = 1; } <br>
} <br>
if ($_ =~ /&lt;Topic r:id=\&quot;$end_line\&quot;&gt;/) { close(DUMP_FILE); &amp;import_done_email($start_line);
last; } <br>
else { if ($do) { print DUMP_FILE &quot;$_\n&quot;; } } <br>
} # end the while <br>
<br>
close(DMOZ); # close up the main file... <br>
<br>
} # end the foreach <br>
<br>
<br>
sub import_done_email { <br>
<br>
my $cat = shift; <br>
open(MAIL,&quot;|/usr/sbin/sendmail -t&quot;) || die &amp;error(&quot;Unable to
open Sendmail. Reason: $!&quot;); <br>
$webmaster = 'webmaster@assistantdirectors'; <br>
print MAIL &quot;To: $webmaster \n&quot;; <br>
print MAIL &quot;From: $webmaster \n&quot;; <br>
print MAIL &quot;Reply-to: $webmaster \n&quot;; <br>
print MAIL &quot;Subject: RE Main $cat Dump... \n\n&quot;; <br>
print MAIL &quot;$cat has now been inported into the SQL database.... \n&quot;;
<br>
print MAIL &quot;\n \n Thanks&quot;; <br>
print MAIL &quot;\n&quot;; <br>
print MAIL &quot;A.J.Newby \n&quot;; <br>
print MAIL &quot;Ace Installer \n&quot;; <br>
close(MAIL); <br>
} <br>
<br>
<br>
# error incase stuff goes wrong... <br>
sub error { <br>
my ($error) = shift; <br>
print $error; exit; <br>
}
Subject Author Views Date
Thread; hot thread DMOZ import question lennie 7873 Jun 3, 2003, 7:36 AM
Thread; hot thread Re: [lennie] DMOZ import question
Paul 7703 Jun 3, 2003, 8:23 AM
Thread; hot thread Re: [Paul] DMOZ import question
lennie 7724 Jun 3, 2003, 8:35 AM
Thread; hot thread Re: [lennie] DMOZ import question
Paul 7748 Jun 3, 2003, 8:39 AM
Thread; hot thread Re: [Paul] DMOZ import question
lennie 7717 Jun 3, 2003, 8:54 AM
Thread; hot thread Re: [lennie] DMOZ import question
Andy 7735 Jun 3, 2003, 9:03 AM
Thread; hot thread Re: [Andy] DMOZ import question
lennie 7727 Jun 3, 2003, 9:20 AM
Thread; hot thread Re: [lennie] DMOZ import question
Andy 7721 Jun 3, 2003, 9:25 AM
Thread; hot thread Re: [Andy] DMOZ import question
lennie 7756 Jun 3, 2003, 9:34 AM
Thread; hot thread Re: [lennie] DMOZ import question
Andy 7746 Jun 3, 2003, 9:41 AM
Thread; hot thread Re: [Andy] DMOZ import question
lennie 7697 Jun 3, 2003, 9:51 AM
Thread; hot thread Re: [lennie] DMOZ import question
Andy 7685 Jun 3, 2003, 9:56 AM
Thread; hot thread Re: [Andy] DMOZ import question
lennie 7684 Jun 3, 2003, 10:08 AM
Thread; hot thread Re: [lennie] DMOZ import question
Andy 7779 Jun 3, 2003, 10:13 AM
Thread; hot thread Re: [Andy] DMOZ import question
lennie 7694 Jun 3, 2003, 10:19 AM
Thread; hot thread Re: [lennie] DMOZ import question
Andy 7736 Jun 3, 2003, 10:24 AM
Thread; hot thread Re: [Andy] DMOZ import question
lennie 7714 Jun 3, 2003, 10:30 AM
Thread; hot thread Re: [lennie] DMOZ import question
Andy 7734 Jun 3, 2003, 10:34 AM
Thread; hot thread Re: [Andy] DMOZ import question
lennie 7678 Jun 3, 2003, 10:38 AM
Thread; hot thread Re: [lennie] DMOZ import question
Andy 7739 Jun 3, 2003, 10:43 AM
Thread; hot thread Re: [Andy] DMOZ import question
lennie 7682 Jun 3, 2003, 11:16 AM
Post; hot thread Re: [lennie] DMOZ import question
Andy 7713 Jun 3, 2003, 11:18 AM
Thread; hot thread Re: [Andy] DMOZ import question
FrankM 7428 Jan 4, 2004, 5:33 PM
Thread; hot thread Re: [FrankM] DMOZ import question
FrankM 7475 Jan 4, 2004, 7:07 PM
Post; hot thread Re: [FrankM] DMOZ import question
Andy 7490 Jan 5, 2004, 1:58 AM