Gossamer Forum
Quote Reply
DMOZ RDF
Has anyone tried to parse the whole DMOZ RDF Database into LinksSQL. Im running a Pentium 750 256MB RAM 9GB SCSI Hard Disk and it has taken roughly 30 hours and has just completed 2 Million Links, 80000 Categories and 17646 CategoryHierarchy entries. How many links should I be expecting with a full dump of the DMOZ Directory and how large would the database be at this stage. I will check the MySQL files once completed. On the DMOZ Home page they state 2,346,124 sites - 339,107 categories I dont think Ill be hitting the 339,107 categories. Or Maybe there is a problem with the Parse_RDF.pl script, I have the latest release of this script and not the one that came with the distribution. Well its taken 20 hours to Parse this RDF file I wonder how long it will take to build this database :)

I better check my settings before I try this wouldn't want to make many mistakes. Also do I have to Index the database so I can use the page.cgi to view the directory? I can't remember what was needed to be done or was it actually build the database and if I made changes to the template they would be reflected through the page.cgi script. I think its the later one but anyways someone will confirm this for me.

Thanks
Jason Xuereb

Quote Reply
Re: DMOZ RDF In reply to
OMG you are either brave or crazy........

With 2 million links your computer is going to die soon.......lol (possibly not).

Are you sure LinksSQL can handle that many links?...You may have just wasted 30 hrs



Paul Wilson. Shocked
(Dont blame me if I'm wrong!)
Quote Reply
Re: DMOZ RDF In reply to
Im just crazy :P My computer wont die just yet. But if I could parse the whole database then it would be alot easier to import for other people as Ill just do a database dump people could download and upload if needed be. But its stopped at 80 000 categories which isnt very good. The links are still going. Does anyone know if I can get the structure RDF and flush my categories table and just import the Categories again? Or will this ruin which categories the links are in?

Any help at this stage would be good.

Jason Xuereb

Quote Reply
Re: DMOZ RDF In reply to
How did you get the content file to your server?......It is 131MB.....what is the telnet command to ftp the file from dmoz straight to your server?

Paul Wilson. Shocked
(Dont blame me if I'm wrong!)
Quote Reply
Re: DMOZ RDF In reply to
wget http://url

:)

Quote Reply
Re: DMOZ RDF In reply to
:) Cheers....

Paul Wilson. Shocked
(Dont blame me if I'm wrong!)
Quote Reply
Re: DMOZ RDF In reply to
WHOAH.......this is scary.........

Ive downloaded 15MB of the content file so far....going at a steady pace of about 80-100Kbps

Where does wget place the file when done?

Do you know how to only import certain categories?

Paul Wilson. Shocked
(Dont blame me if I'm wrong!)
Quote Reply
Re: DMOZ RDF In reply to
in the directory you ran the command from :)

if you parse the database into mysql can u do a dump and gzip and give me the URL please?

Quote Reply
Re: DMOZ RDF In reply to
ok......i thought thats what you just spent 30 hrs doing?

Do you need to import structure rdf first or can you just import the content file?



Paul Wilson. Shocked
(Dont blame me if I'm wrong!)
Quote Reply
Re: DMOZ RDF In reply to
i imported the content RDF but the links are still importing but the categories stopped at 80,000. which isnt a good thing.

Quote Reply
Re: DMOZ RDF In reply to
Why did they stop?

Im up to 85MB

Paul Wilson. Shocked
(Dont blame me if I'm wrong!)
Quote Reply
Re: DMOZ RDF In reply to
not sure im using phpMyAdmin to browse through it the links are increasing but the categories have stopped. I dont know why this is. but i do have 2029872 now :)

you got icq?

Quote Reply
Re: DMOZ RDF In reply to
no havent got icq :(

jeeeeez.......I dont think I can import the whole thing, the server only has 128MB ram so I dont think it will take the whole shebang!

110MB :)



Paul Wilson. Shocked
(Dont blame me if I'm wrong!)
Quote Reply
Re: DMOZ RDF In reply to
well after you un compress the RDF it gets to 700MB :P

good luck

Quote Reply
Re: DMOZ RDF In reply to
I aint uncompressing it....the parser allows you to leave it gzipped.

I have 20GB free anyway :)

Paul Wilson. Shocked
(Dont blame me if I'm wrong!)
Quote Reply
Re: DMOZ RDF In reply to
Parse it into a raw database and check the file size of this database and let me know how big it gets?

Jason Xuereb

Quote Reply
Re: DMOZ RDF In reply to
Coincidentally I just made a new database called linkssql where all the stuff (technical term) is gonna go....

Paul Wilson. Shocked
(Dont blame me if I'm wrong!)
Quote Reply
Re: DMOZ RDF In reply to
hey Paul howse it going?

Quote Reply
Re: DMOZ RDF In reply to
Well I tried to import and I keep getting errors saying duplicate key and it keeps going on about DBI::st and says "broken pipe"

Paul Wilson. Shocked
(Dont blame me if I'm wrong!)
Quote Reply
Re: DMOZ RDF In reply to
Have you had any luck importing ALL the data? I am attempting to do the same thing.
How big did your database get after full import?
How does it run, like a dog?

Can you please supply your command line for importing all data?

Much appreciated.

Quote Reply
Re: DMOZ RDF In reply to
For those interested.........

I imported the whole of the ADULT category into links sql.....not because Im a pervert but just as a test and it happened to be the first category in the content.rdf file.

64,000 Links
4500 Categories

We have a Pentium III 350mhz 128MB RAM

It took 10 mins to import and 15 mins to build

Everything is running quite smoothly......

I will do a mysql dump of the 60,000 links for anyone who wants it.....

paul@audio-grabber.com

For some reason.....when I search I always get "No matching links"......any ideas?.....I re-indexed.



Paul Wilson. Shocked
(Dont blame me if I'm wrong!)
Quote Reply
Re: DMOZ RDF In reply to
Hey Paul,

My whole DMOZ import crashed. Mind you I do also have another live site on this server. Ive just imported the Games and Computers sections from DMOZ. I wish when skipping through the RDF I didn't have to wait for all those Adult categories :P

Damn porno

Anyways Im doing the Business category now and its skipping through the Arts section yet again :)

Quote Reply
Re: DMOZ RDF In reply to
lol.......

Yeah it is REALLY annoying having to skip past all the other categories.

I imported 300,000 links and 30,000 categories but deleted them again ......lol

Paul Wilson.
http://www.wiredon.net
Quote Reply
Re: DMOZ RDF In reply to
lol i waited 2 days for 2 million links before i deleted them again after realising it stopped doing the categories.

What would be good is if users could download dumps of all the categories.

I might look at doing this after I finish importing everything I want.

Quote Reply
Re: DMOZ RDF In reply to
I may just beat you to it..........

I am going to do a dump of the first three categories to start with and see how it goes.

Paul Wilson.
http://www.wiredon.net