Gossamer Forum
Home : Products : Gossamer Links : Discussions :

DMOZ import question

Quote Reply
DMOZ import question
For my directory, I would like to have sub cats of DMOZ's Top/Arts, Top/Health, Top/Recreation and Top/Society. Is there an easier way to import these than importing the entire database and then trimming unwanted cats?

Thanks in advance.
Quote Reply
Re: [lennie] DMOZ import question In reply to
Yep, cut it into chunks.
Quote Reply
Re: [Paul] DMOZ import question In reply to
Cool, I have a perl script that Andy wrote (http://www.gossamer-threads.com/...=dmoz%20perl;#242443) to cut the content file into 17 pieces. How do I combine the data? Doesn't each import erase the prior one?
Quote Reply
Re: [lennie] DMOZ import question In reply to
Add --rdf-update to the command and it won't overwrite.
Quote Reply
Re: [Paul] DMOZ import question In reply to
Thanks, I will give it a try.
Quote Reply
Re: [lennie] DMOZ import question In reply to
This is what I use;

perl ./admin/nph-import.cgi --import=RDF --source=science.dump.slice --destination=./admin/defs --rdf-category="Top/Science" --rdf-add-date="2001-01-01" --rdf-destination="Science" --rdf-update

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] DMOZ import question In reply to
Andy,
I am trying to get your perl script to run to slice the content.rdf.u8 file. It is already on my server, unzipped. Help please. . .
Quote Reply
Re: [lennie] DMOZ import question In reply to
How are you running it, and what errors are you getting?

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] DMOZ import question In reply to
I named it dmoz.cgi and just tried running it through my browser. I haven't tried to run it via telnet. When running it through the browser I get an internal server error (chmod 755). I also tried removing the part of the script that has the server get and unzip the file since it is already on my server (as dmoz2.cgi).

Thanks in advance!
Quote Reply
Re: [lennie] DMOZ import question In reply to
You *have* to run it via Telnet/SSH. I purposly didn't put a 'header' in it, otherwise grabbing a 900Mb file (content.rdf.u8.gz) is gonna obviously make the script time out.

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] DMOZ import question In reply to
Andy,

I already have the content.rdf.u8 file unzipped and on my server.

This is the command I tried to run: perl /home/virtual/site2/fst/var/www/cgi-bin/dmoz.cgi

And these are the error messages: Backslash found where operator expected at /home/virtual/site2/fst/var/www/cgi-bin/dmoz.cgi line 4, near "html \"
(Do you need to predeclare html?)
Backslash found where operator expected at /home/virtual/site2/fst/var/www/cgi-bin/dmoz.cgi line 4, near "n\"
Operator or semicolon missing before &quot at /home/virtual/site2/fst/var/www/cgi-bin/dmoz.cgi line 4.
Ambiguous use of & resolved as operator & at /home/virtual/site2/fst/var/www/cgi-bin/dmoz.cgi line 4.
Scalar found where operator expected at /home/virtual/site2/fst/var/www/cgi-bin/dmoz.cgi line 8, near "$full"
(Missing semicolon on previous line?)
Semicolon seems to be missing at /home/virtual/site2/fst/var/www/cgi-bin/dmoz.cgi line 31.
Precedence problem: open Sendmail should be open(Sendmail) at /home/virtual/site2/fst/var/www/cgi-bin/dmoz.cgi line 32.
Operator or semicolon missing before &quot at /home/virtual/site2/fst/var/www/cgi-bin/dmoz.cgi line 32.
Ambiguous use of & resolved as operator & at /home/virtual/site2/fst/var/www/cgi-bin/dmoz.cgi line 32.
syntax error at /home/virtual/site2/fst/var/www/cgi-bin/dmoz.cgi line 3, near "br>"
syntax error at /home/virtual/site2/fst/var/www/cgi-bin/dmoz.cgi line 4, near "type:"
syntax error at /home/virtual/site2/fst/var/www/cgi-bin/dmoz.cgi line 5, near "br>"
syntax error at /home/virtual/site2/fst/var/www/cgi-bin/dmoz.cgi line 13, near "br>"

Not sure what it all means. . .

Last edited by:

lennie: Jun 3, 2003, 9:53 AM
Quote Reply
Re: [lennie] DMOZ import question In reply to
Could you upload the file you are using? Shouldn't be any errors in it Unsure

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] DMOZ import question In reply to
Andy,

I checked my script and it had errors when I saved it.(" instead of ", oops) So I reupped it again without the errors, and ran it again.

This command: perl /home/virtual/site2/fst/var/www/cgi-bin/dmoz.cgi
Here are the new errors: Backslash found where operator expected at /home/virtual/site2/fst/var/www/cgi-bin/dmoz.cgi line 4, near "html \"
(Do you need to predeclare html?)
Backslash found where operator expected at /home/virtual/site2/fst/var/www/cgi-bin/dmoz.cgi line 4, near "n\"
Operator or semicolon missing before &quot at /home/virtual/site2/fst/var/www/cgi-bin/dmoz.cgi line 4.
Ambiguous use of & resolved as operator & at /home/virtual/site2/fst/var/www/cgi-bin/dmoz.cgi line 4.
Scalar found where operator expected at /home/virtual/site2/fst/var/www/cgi-bin/dmoz.cgi line 8, near "$full"
(Missing semicolon on previous line?)
Semicolon seems to be missing at /home/virtual/site2/fst/var/www/cgi-bin/dmoz.cgi line 31.
Precedence problem: open Sendmail should be open(Sendmail) at /home/virtual/site2/fst/var/www/cgi-bin/dmoz.cgi line 32.
Operator or semicolon missing before &quot at /home/virtual/site2/fst/var/www/cgi-bin/dmoz.cgi line 32.
Ambiguous use of & resolved as operator & at /home/virtual/site2/fst/var/www/cgi-bin/dmoz.cgi line 32.
syntax error at /home/virtual/site2/fst/var/www/cgi-bin/dmoz.cgi line 3, near "br>"
syntax error at /home/virtual/site2/fst/var/www/cgi-bin/dmoz.cgi line 4, near "type:"
syntax error at /home/virtual/site2/fst/var/www/cgi-bin/dmoz.cgi line 5, near "br>"
syntax error at /home/virtual/site2/fst/var/www/cgi-bin/dmoz.cgi line 13, near "br>"
Execution of /home/virtual/site2/fst/var/www/cgi-bin/dmoz.cgi aborted due to compilation errors.

Here is the script:

#!/usr/bin/perl

print "Content-type: text/html \n\n";

# helps us catch nasty errors use CGI::Carp qw(fatalsToBrowser);

$full = 1; # if only wanting everything bar regional and world...use this!

######################################################
# GET THE DUMP FILE STYARTS HERE #####################
######################################################

# get rid of the old file... #

# unlink "content.rdf.u8";

# $main_rdf_start_time = time;

# `wget --no-directories http://dmoz.org/rdf/content.rdf.u8.gz`;

# `gzip -d content.rdf.u8.gz`; # finished with raf.u8.gz, so delete now...keep space!

# unlink "content.rdf.u8.gz";

#$main_rdf_end_time = time;

#$main_rdf_total_time = $main_rdf_end_time - $main_rdf_start_time;

# open(MAIL,"|/usr/sbin/sendmail -t") || die &error("Unable to open Sendmail. Reason: $!");
# $webmaster = 'webmaster@assistantdirectors.com';
# print MAIL "To: $webmaster \n";
# print MAIL "From: $webmaster \n";
# print MAIL "Reply-to: $webmaster \n";
# print MAIL "Subject: RE Dump... \n\n";
# print MAIL "content.rdf.u8.gz has successfully been downloaded and decompressed. Took $main_rdf_total_time\n";
# print MAIL "\n \n Thanks";
# print MAIL "\n";
# print MAIL "A.J.Newby \n";
# print MAIL "Ace Installer \n";
# close(MAIL);

###################################################
### END THE GETTING OF THE MAIN DUMP FILE #########

###################################################

##################################################
### CUT THE DUMP INTO 17 SMALLER CATEGORIES ######
##################################################

$categories = "Top\/Adult::Top\/Arts";
$categories .= "~Top\/Arts::Top\/Business";
$categories .= "~Top\/Business::Top\/Computers";
$categories .= "~Top\/Computers::Top\/Games";
$categories .= "~Top\/Games::Top\/Health";
$categories .= "~Top\/Health::Top\/Home";
$categories .= "~Top\/News::Top\/Recreation";
$categories .= "~Top\/Reference::Top\/Regional";
$categories .= "~Top\/Regional::Top\/Science";
$categories .= "~Top\/Science::Top\/Shopping";
$categories .= "~Top\/Shopping::Top\/Society";
$categories .= "~Top\/Sports::Top\/World";
$categories .= "~Top\/Home::Top\/Kids_and_Teens";

@categories = split("~", $categories); # now loop through them all....

foreach (@categories) {
@aaa = split("::", $_);
$start_line = $aaa[0];
$end_line = $aaa[1];
$file_save = lc($start_line);
$file_save =~ s/Top//i; # open up the main dmoz dump u8 file

open(DMOZ, "./content.rdf.u8") || &error("Unable to read dump file. Reason: $!"); # category
open(CLEAN_DUMP, ">./$file_save.dump.slice");
print CLEAN_DUMP ""; close(CLEAN_DUMP); # to make the file blank...
open(DUMP_FILE, ">>./$file_save.dump.slice") or &error("cant do it: $! : ./$file_save.dump.slice"); # open ready for input....

# start a while..not closed til right near the end...
$do = 0;
while (<DMOZ>) {
# doing the arts category only needs this...then if the lines matches the regex we are moved onto the next category..
# check to see when we wanna start, otherwise use next;
if ($start_line) {
if ($_ =~ /<Topic r:id=\"$start_line\">/) { $do = 1; }
}
if ($_ =~ /<Topic r:id=\"$end_line\">/) { close(DUMP_FILE); &import_done_email($start_line); last; }
else { if ($do) { print DUMP_FILE "$_\n"; } }
} # end the while

close(DMOZ); # close up the main file...

} # end the foreach


sub import_done_email {

my $cat = shift;
open(MAIL,"|/usr/sbin/sendmail -t") || die &error("Unable to open Sendmail. Reason: $!");
$webmaster = 'webmaster@assistantdirectors.com';
print MAIL "To: $webmaster \n";
print MAIL "From: $webmaster \n";
print MAIL "Reply-to: $webmaster \n";
print MAIL "Subject: RE Main $cat Dump... \n\n";
print MAIL "$cat has now been inported into the SQL database.... \n";
print MAIL "\n \n Thanks";
print MAIL "\n";
print MAIL "A.J.Newby \n";
print MAIL "Ace Installer \n";
close(MAIL);
}


# error incase stuff goes wrong...
sub error {
my ($error) = shift;
print $error; exit;
}

Could the problem be that I already have the unzipped content.rdf.u8 on my server?

Thanks

Lennie
Quote Reply
Re: [lennie] DMOZ import question In reply to
You sure you uploaded it ok? I don't see any reason for the script to screw up with the errors you pasted Unimpressed

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] DMOZ import question In reply to
I down loaded the copy from my server. Here it is:

<br>
#!/usr/bin/perl <br>
<br>
print &quot;Content-type: text/html \n\n&quot;; <br>
<br>
# helps us catch nasty errors use CGI::Carp qw(fatalsToBrowser); <br>
<br>
$full = 1; # if only wanting everything bar regional and world...use this! <br>
<br>
###################################################### <br>
# GET THE DUMP FILE STYARTS HERE ##################### <br>
###################################################### <br>
<br>
# get rid of the old file... # <br>
<br>
# unlink &quot;content.rdf.u8&quot;; <br>
<br>
# $main_rdf_start_time = time; <br>
<br>
# `wget --no-directories http://dmoz.org/rdf/content.rdf.u8.gz`; <br>
<br>
# `gzip -d content.rdf.u8.gz`; # finished with raf.u8.gz, so delete now...keep
space! <br>
<br>
# unlink &quot;content.rdf.u8.gz&quot;; <br>
<br>
#$main_rdf_end_time = time; <br>
<br>
#$main_rdf_total_time = $main_rdf_end_time - $main_rdf_start_time; <br>
<br>
# open(MAIL,&quot;|/usr/sbin/sendmail -t&quot;) || die &amp;error(&quot;Unable
to open Sendmail. Reason: $!&quot;); <br>
# $webmaster = 'webmaster@assistantdirectors.com'; <br>
# print MAIL &quot;To: $webmaster \n&quot;; <br>
# print MAIL &quot;From: $webmaster \n&quot;; <br>
# print MAIL &quot;Reply-to: $webmaster \n&quot;; <br>
# print MAIL &quot;Subject: RE Dump... \n\n&quot;; <br>
# print MAIL &quot;content.rdf.u8.gz has successfully been downloaded and decompressed.
Took $main_rdf_total_time\n&quot;; <br>
# print MAIL &quot;\n \n Thanks&quot;; <br>
# print MAIL &quot;\n&quot;; <br>
# print MAIL &quot;A.J.Newby \n&quot;; <br>
# print MAIL &quot;Ace Installer \n&quot;; <br>
# close(MAIL); <br>
<br>
################################################### <br>
### END THE GETTING OF THE MAIN DUMP FILE ######### <br>
<br>
################################################### <br>
<br>
################################################## <br>
### CUT THE DUMP INTO 17 SMALLER CATEGORIES ###### <br>
################################################## <br>
<br>
$categories = &quot;Top\/Adult::Top\/Arts&quot;; <br>
$categories .= &quot;~Top\/Arts::Top\/Business&quot;; <br>
$categories .= &quot;~Top\/Business::Top\/Computers&quot;; <br>
$categories .= &quot;~Top\/Computers::Top\/Games&quot;; <br>
$categories .= &quot;~Top\/Games::Top\/Health&quot;; <br>
$categories .= &quot;~Top\/Health::Top\/Home&quot;; <br>
$categories .= &quot;~Top\/News::Top\/Recreation&quot;; <br>
$categories .= &quot;~Top\/Reference::Top\/Regional&quot;; <br>
$categories .= &quot;~Top\/Regional::Top\/Science&quot;; <br>
$categories .= &quot;~Top\/Science::Top\/Shopping&quot;; <br>
$categories .= &quot;~Top\/Shopping::Top\/Society&quot;; <br>
$categories .= &quot;~Top\/Sports::Top\/World&quot;; <br>
$categories .= &quot;~Top\/Home::Top\/Kids_and_Teens&quot;; <br>
<br>
@categories = split(&quot;~&quot;, $categories); # now loop through them all....
<br>
<br>
foreach (@categories) { <br>
@aaa = split(&quot;::&quot;, $_); <br>
$start_line = $aaa[0]; <br>
$end_line = $aaa[1]; <br>
$file_save = lc($start_line); <br>
$file_save =~ s/Top//i; # open up the main dmoz dump u8 file <br>
<br>
open(DMOZ, &quot;./content.rdf.u8&quot;) || &amp;error(&quot;Unable to read dump
file. Reason: $!&quot;); # category <br>
open(CLEAN_DUMP, &quot;&gt;./$file_save.dump.slice&quot;); <br>
print CLEAN_DUMP &quot;&quot;; close(CLEAN_DUMP); # to make the file blank...
<br>
open(DUMP_FILE, &quot;&gt;&gt;./$file_save.dump.slice&quot;) or &amp;error(&quot;cant
do it: $! : ./$file_save.dump.slice&quot;); # open ready for input.... <br>
<br>
# start a while..not closed til right near the end... <br>
$do = 0; <br>
while (&lt;DMOZ&gt;) { <br>
# doing the arts category only needs this...then if the lines matches the regex
we are moved onto the next category.. <br>
# check to see when we wanna start, otherwise use next; <br>
if ($start_line) { <br>
if ($_ =~ /&lt;Topic r:id=\&quot;$start_line\&quot;&gt;/) { $do = 1; } <br>
} <br>
if ($_ =~ /&lt;Topic r:id=\&quot;$end_line\&quot;&gt;/) { close(DUMP_FILE); &amp;import_done_email($start_line);
last; } <br>
else { if ($do) { print DUMP_FILE &quot;$_\n&quot;; } } <br>
} # end the while <br>
<br>
close(DMOZ); # close up the main file... <br>
<br>
} # end the foreach <br>
<br>
<br>
sub import_done_email { <br>
<br>
my $cat = shift; <br>
open(MAIL,&quot;|/usr/sbin/sendmail -t&quot;) || die &amp;error(&quot;Unable to
open Sendmail. Reason: $!&quot;); <br>
$webmaster = 'webmaster@assistantdirectors'; <br>
print MAIL &quot;To: $webmaster \n&quot;; <br>
print MAIL &quot;From: $webmaster \n&quot;; <br>
print MAIL &quot;Reply-to: $webmaster \n&quot;; <br>
print MAIL &quot;Subject: RE Main $cat Dump... \n\n&quot;; <br>
print MAIL &quot;$cat has now been inported into the SQL database.... \n&quot;;
<br>
print MAIL &quot;\n \n Thanks&quot;; <br>
print MAIL &quot;\n&quot;; <br>
print MAIL &quot;A.J.Newby \n&quot;; <br>
print MAIL &quot;Ace Installer \n&quot;; <br>
close(MAIL); <br>
} <br>
<br>
<br>
# error incase stuff goes wrong... <br>
sub error { <br>
my ($error) = shift; <br>
print $error; exit; <br>
}
Quote Reply
Re: [lennie] DMOZ import question In reply to
Well that will be why Wink It has loads of HTML tags etc in it. Use the attached one.

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] DMOZ import question In reply to
Yeah, I saw that too. Cleaned it up and reloaded and now I am here.

Command: perl /home/virtual/site2/fst/var/www/cgi-bin/dmoz.cgi
response: Content-type: text/html

Unable to read dump file. Reason: No such file or directory

thanks for all your help!
Quote Reply
Re: [lennie] DMOZ import question In reply to
Have you 'unzipped' the content.rdf.u8.gz file yet? If not, type;

gzip -d content.rdf.u8.gz

That should create a file called content.rdf.u8 in the current directory. Also, when running the script, make sure it is in the same directory as the content.rdf.u8 file.

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] DMOZ import question In reply to
It is unzipped. And I have the dmoz.cgi file in the same folder as the content.rdf.u8

Lennie
Quote Reply
Re: [lennie] DMOZ import question In reply to
What happens if you remove all the instances of ./ in the 'open' commands?

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] DMOZ import question In reply to
I remove the "./" instances from here:

open(DMOZ, "./content.rdf.u8") || &error("Unable to read dump file. Reason: $!"); # category
open(CLEAN_DUMP, ">./$file_save.dump.slice");
print CLEAN_DUMP ""; close(CLEAN_DUMP); # to make the file blank...
open(DUMP_FILE, ">>./$file_save.dump.slice") or &error("cant do it: $! : ./$file_save.dump.slice"); # open ready for input....

It is now like this:

open(DMOZ, "content.rdf.u8") || &error("Unable to read dump file. Reason: $!"); # category
open(CLEAN_DUMP, ">$file_save.dump.slice");
print CLEAN_DUMP ""; close(CLEAN_DUMP); # to make the file blank...
open(DUMP_FILE, ">>$file_save.dump.slice") or &error("cant do it: $! : $file_save.dump.slice"); # open ready for input....

and I still get this: Unable to read dump file. Reason: No such file or directory
Quote Reply
Re: [lennie] DMOZ import question In reply to
I'm off for the night now. If you want to send over FTP details, and SSH if possible, I'll have a look at it tomorrow for you.

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] DMOZ import question In reply to
Hi,

I was trying to use dmoz.cgi script you had kindly provided to help break up the main DMOZ RDF File. This is a great help Smile, thanks!, but when I ran it, the script would only create slices for certain categories. I couldn't get it to include certain categories I need such as Top/Computers, or Top/Regional, or Top/Society. The slice files for some of these were created (e.g. computers.dump.slice), but they were empty. I don't think the process was stopping, because some of the empty files are in the middle and not at the end. Do you know if the structure of the RDF file changed since this script was written? or do you know if the problem be something else? Thanks in advance for any insights you might have

--FrankM
Quote Reply
Re: [FrankM] DMOZ import question In reply to
Hi-

I dug around some more, and found a post with a great script that makes it easy to create a slice of the specific DMOZ categories or subcategories that you want:

http://www.gossamer-threads.com/...i?post=246319#246319

If you want to import just part of the dmoz file into your Links SQL, but don't want to wait hours and hours while it goes through the entire content, this works really well

--Frank
Quote Reply
Re: [FrankM] DMOZ import question In reply to
*advert* http://new.linkssql.net/...amp;page=DMOZ_Wizard *advert*

Wink

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!