Gossamer Forum
Home : Products : Gossamer Links : Discussions :

DMOZ Statistics........

Quote Reply
DMOZ Statistics........
I know that a lot of people are interested in importing from DMOZ and want to know more about how much server space is needed and how long it takes etc.....so I thought I'd give those interested, a little bit of information........

Right, over the last two days I have been importing from DMOZ and below are the steps you need to take to do this as well as other useful pieces of information...

Firstly you obviously need to have Links SQL installed, then the next step is to get the content.rdf file. This file can be found at the DMOZ website and is 139MB gzipped and 700MB unzipped. Therefore you need to make sure you have enough space on your server to begin with.

To get the content.rdf file to your server, login to your telnet account and type:

wget ftp://ftp.dmoz.org/rdf/content.rdf.u8.gz

After about 20 minutes (depending on your line speed), you should have the 139MB file on your server.

Next you need to type:

gzip -d content.rdf.u8.gz

This will unzip the file to its full 700MB size. You don't have to unzip it as Parse_RDF.pl will do it for you if you specify within the script.

Next you need to upload Parse_RDF.pl to your server and chmod it to 755 and edit the variables inside the script such as which category you want to import and the path to your content.rdf file.

Back in telnet type:

perl Parse_RDF.pl

This will begin the import.

When complete you will need to re-index your directory from the admin area..(only takes 30 secs)...then rebuild from telnet using:

perl nph-build.cgi

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ SOME STATS
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I imported the first two categories...Adult and Arts

The Adult Category took 10 mins to import and 15 to build.

The Arts category took 60 mins to import and 30 to build.

Both categories combined totals 30,155 categories and 300,000 links.

Total space needed for database is around 100MB

Including the content.rdf file you will need 800MB free space, but once you have imported you can delete the 700MB file.

Hope this helps you all.......

Paul Wilson.
NEW http://www.wiredon.net
Quote Reply
Re: [RedRum] DMOZ Statistics........ In reply to
Just a small update. The unzipped copy is now 900+mb, not 700mb as was stated quite a while ago (just incase you only have 1gb to play with Wink

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [AndyNewby] DMOZ Statistics........ In reply to
....and it doesn't need to be unzipped in the first place :)
Quote Reply
Re: [RedRum] DMOZ Statistics........ In reply to
Yeah, that too Wink

Andy

BTW: you gonna reply to my PM's? Tongue


Last edited by:

AndyNewby: Dec 29, 2001, 4:43 AM
Quote Reply
Re: [AndyNewby] DMOZ Statistics........ In reply to
>>BTW: you gonna reply to my PM's?<<

I haven't received any.
Quote Reply
Re: [RedRum] DMOZ Statistics........ In reply to
Really? The last one i sent was at Dec 28, 2001, 9:47 AM.

Bug? Unsure

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [AndyNewby] DMOZ Statistics........ In reply to
Ok, I'll just email you what I sent via PM Wink

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [AndyNewby] DMOZ Statistics........ In reply to
Just a quick question, If I wanted to insert "SUB1" into the description table while importing, which command would I use in conjuction with the rest of the telnet commands??? (or any other area, such as meta description or meta keywords???)

Thanks

</not a clue>
Quote Reply
Re: [Kilroy] DMOZ Statistics........ In reply to
You'd do it after the import from the SQL monitor.

UPDATE prefix_Links SET Description = 'SUB1'


Last edited by:

RedRum: Dec 29, 2001, 1:36 PM
Quote Reply
Re: [RedRum] DMOZ Statistics........ In reply to
Thanks,

I actually made a change to the table properties, and set the default as "SUB1", that way every category will have a SUB1 inputed automatically. The only thing, is I had to change the properties from "text" to "char", is this going to be a problem setting it up this way???

</not a clue>
Quote Reply
Re: [Kilroy] DMOZ Statistics........ In reply to
Hi,

Can someone tell me where to access the file Parse_RDF.pl .

Many thanks.


Webmaster
http://www.e-bannerx.com
Quote Reply
Re: [fulcan] DMOZ Statistics........ In reply to
You want to use nph-import.cgi (in the 2.0.5 and 2.1.0). I think Parse_RDF.pl was from version 1 (not definate though Tongue)

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] DMOZ Statistics........ In reply to
From that 900+ MB DMOZ file, how can you add just selected categories to your database?

wsjb78

----------------------------------------------

The third principle of sentient life is the capacity for self-sacrifice, the conscious ability to override evolution and self- preservation for a cause, a friend, a loved one.

ICQ: 171751720
Quote Reply
Re: [wsjb78] DMOZ Statistics........ In reply to
I havn't found a way to do it yet....what you need to do is import the entire database, and then delete the links from it. I have a copy of this, with pretty much the most upto-date category list. PM/Email me if you would like to get hold of a copy, rather than having to do it yourself Tongue

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] DMOZ Statistics........ In reply to
Guys,



Find the line were a category starts and stop, then use vi (unix) to cut out all lines included in one or more categories as you would like to import.

Use this command in vi

x,y w filename

Were x and y is start and stop, and w will save the lines in a new file "filename"

Regards, Tomas

Here is some commands who may can be usable...

#!/bin/sh
wget http://dmoz.org/rdf/content.rdf.u8.gz
gunzip content.rdf.u8.gz
split -l 2785080 content.rdf.u8 split
perl 1.pl
perl 2.pl
perl 3.pl
perl 4.pl
perl 5.pl
perl 6.pl
perl 7.pl
cat split1.iso >> content.rdf.iso
cat split2.iso >> content.rdf.iso
cat split3.iso >> content.rdf.iso
cat split4.iso >> content.rdf.iso
cat split5.iso >> content.rdf.iso
cat split6.iso >> content.rdf.iso
cat split7.iso >> content.rdf.iso
cat -n content.rdf.iso|grep '<Topic r:id="Top/Games">' >> line.nr
cat -n content.rdf.iso|grep '<Topic r:id="Top/Health">' >> line.nr
cat -n content.rdf.iso|grep '<Topic r:id="Top/World">' >> line.nr
cat -n content.rdf.iso|grep '<Topic r:id="Top/World/Català">' >> line.nr
cat -n content.rdf.iso|grep '<Topic r:id="Top/World/Chinese_Simplified">' >> line.nr
#bort Chinese_Simplified
cat -n content.rdf.iso|grep '<Topic r:id="Top/World/Dansk">' >> line.nr
cat -n content.rdf.iso|grep '<Topic r:id="Top/World/Deutsch">' >> line.nr
cat -n content.rdf.iso|grep '<Topic r:id="Top/World/Eesti">' >> line.nr
#Eesti
cat -n content.rdf.iso|grep '<Topic r:id="Top/World/Español">' >> line.nr
cat -n content.rdf.iso|grep '<Topic r:id="Top/World/Esperanto">' >> line.nr
#bort Esperanto
cat -n content.rdf.iso|grep '<Topic r:id="Top/World/Euskara">' >> line.nr
cat -n content.rdf.iso|grep '<Topic r:id="Top/World/Faroese">' >> line.nr
cat -n content.rdf.iso|grep '<Topic r:id="Top/World/Farsi">' >> line.nr
#bort Farsi
cat -n content.rdf.iso|grep '<Topic r:id="Top/World/Français">' >> line.nr
cat -n content.rdf.iso|grep '<Topic r:id="Top/World/Frysk">' >> line.nr
#Frysk
cat -n content.rdf.iso|grep '<Topic r:id="Top/World/Galego">' >> line.nr
cat -n content.rdf.iso|grep '<Topic r:id="Top/World/Greek">' >> line.nr
#bort Greek
cat -n content.rdf.iso|grep '<Topic r:id="Top/World/Italiano">' >> line.nr
cat -n content.rdf.iso|grep '<Topic r:id="Top/World/Japanese">' >> line.nr
# bort Japanese
cat -n content.rdf.iso|grep '<Topic r:id="Top/World/Nederlands">' >> line.nr
cat -n content.rdf.iso|grep '<Topic r:id="Top/World/Norsk">' >> line.nr
cat -n content.rdf.iso|grep '<Topic r:id="Top/World/Polska">' >> line.nr
# bort Polska
cat -n content.rdf.iso|grep '<Topic r:id="Top/World/Suomi">' >> line.nr
cat -n content.rdf.iso|grep '<Topic r:id="Top/World/Svenska">' >> line.nr
cat -n content.rdf.iso|grep '<Topic r:id="Top/World/Tagalog">' >> line.nr
#bort Tagalog