Gossamer Forum
Home : Products : Gossamer Links : Discussions :

Now I'm excited!!

(Page 1 of 2)
> >
Quote Reply
Now I'm excited!!
Hi,

I've just written a script that will split up content.rdf.u8 (currently over 900MB) into its individual categories eg...

Adult.part
Arts.part

....this will save people tons of time doing imports I hope Cool

I'm sure some people have written similar scripts but it was such a cool moment when I executed the script and a nice 28MB Adult file was sitting on my E: drive :)

It is a fairly short script too.

Last edited by:

RedRum: Feb 14, 2002, 8:30 AM
Quote Reply
Re: [RedRum] Now I'm excited!! In reply to
Sound's great... You planning on selling it?
Quote Reply
Re: [SeanP] Now I'm excited!! In reply to
I just need to do some thorough testing and fix up a few little inconsistencies then I'll have to decide what I'm going to do about releasing it.
Quote Reply
Re: [RedRum] Now I'm excited!! In reply to
Lol, sounds like that script I wrote a couple of months ago Wink Good job though. It saves loads of time...you can even get the complete regional category done in less than a day! Sly (on my slow server that is).

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [AndyNewby] Now I'm excited!! In reply to
>>
you can even get the complete regional category done in less than a day!
<<

Hmm mind if I see your code?

FWIW mine takes about 5 minutes to do the whole content file.
Quote Reply
Re: [RedRum] Now I'm excited!! In reply to
Code:
Hmm mind if I see your code?

Yup, I do Tongue Its very messy, and you would probably moan to me about every other line Wink

Oh, and mine only takes 5 mins to decompress the tar.gz file, and then cut it into the smaller sections. Then it runs nph-import.cgi to import the categories into the database, and then it does a database dump into a file, tar.gz's it up and then its ready to be downloaded Smile All without me even having to do anything Smile

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [AndyNewby] Now I'm excited!! In reply to
Welp it's up to you but I'd really be interested to see how you did it. At the end of the day if you got it to work the code must be half decent as it isn't a simple thing to get to work.
Quote Reply
Re: [RedRum] Now I'm excited!! In reply to
If you'd answer my IM messages, and help me test my PayPal script (I need someone to make a test purchase, obviously refundable), cos I'm not able to test it with the same account as I'm sending the money from Tongue

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [AndyNewby] Now I'm excited!! In reply to
That's blackmail. I'm not that desperate Smile
Quote Reply
Re: [RedRum] Now I'm excited!! In reply to
Thats ok. I'll get someone else to test it Wink

Anyway, as for it be hard..its not really ;) All you need to do is search for the top category, like;

Code:
# check to see when we wanna start, otherwise use next;
if ($_ =~ /<Topic r:id=\"Top\/Recreation\">/) { $do = 1; }

# we must have got to the end of this category if we are getting this message...
if ($_ =~ /<Topic r:id=\"Top\/Reference\">/) { $line_count++; close(DUMP_ARTS); &mail_admin("$line_count"); exit; }
else
{ if ($do) { print DUMP_ARTS "$_\n"; $line_count++; print "print to recreation.... Line: $line_count\n"; } }

Like I said, messy code, but it works Smile

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [RedRum] Now I'm excited!! In reply to
This sounds like a case of "I'll show you mine if you show me yours..." Shocked
Quote Reply
Re: [AndyNewby] Now I'm excited!! In reply to
Ok, this isn't a criticizm just so you know, but $_ =~ can be ommitted from regex's

So instead of

if ($_ =~ /<Topic r:id=\"Top\/Recreation\">/) { $do = 1; }

you can do:

if (/<Topic r:id=\"Top\/Recreation\">/) { $do = 1; }

Also "s don't need to be escaped and you can use m,,, to prevent escaping forward slashes, so you'd end up with:

if (m,<Topic r:id="Top/Recreation">,) { $do = 1; }

_Not_ a criticizm. You've got to admit it looks neater.

Last edited by:

RedRum: Feb 14, 2002, 11:06 AM
Quote Reply
Re: [RedRum] Now I'm excited!! In reply to
Yup, spose so. bad habits have a bad habit of dying hard with me Wink Same with using 'my'. I know I should use it, but I just always forget Tongue

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [AndyNewby] Now I'm excited!! In reply to
You should not have to hardcode the category names in the script. Use a regex to extract them and their content. This is what I do - cleaner code (few dozen lines, including fluff lines like print statements) and bases are covered if a DMOZ category is added, removed or renamed (albeit unlikely). BTW, my script takes about 5 minutes to execute. But this is when I run locally from the browser. If I run on our web server from the web, it takes a long time. You may want to setup Apache and ActiveState Perl on your local computer. Then run the script locally. It should take a fraction of the time for the script to complete.



Cheers - Dan Cool

----
Cheers,

Dan
Founder and CEO

LionsGate Creative
GoodPassRobot
Magelln
Quote Reply
Re: [dan] Now I'm excited!! In reply to
Quote:
It should take a fraction of the time for the script to complete.
Maybe, but getting the content.rdf.gz.u8 file will take 10 weeks on my cruddy connection Wink

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [dan] Now I'm excited!! In reply to
>>
You should not have to hardcode the category names in the script.
<<

That's what I avoided too.

Code:
if (/<Topic r:id="([^"]+)">/) {

or I guess you could use:

Code:
if (/<Topic r:id="(.+?)">/) {

Last edited by:

RedRum: Feb 14, 2002, 11:29 AM
Quote Reply
Re: [AndyNewby] Now I'm excited!! In reply to
Ahh, that's right. You are on a slow dial-up connection. Maybe you can have DMOZ snail mail you the file Wink Do you not have access to broadband cable or ADSL Internet access where you live? With what you do (work the Net), you should have a faster connection.



Cheers - Dan Cool

----
Cheers,

Dan
Founder and CEO

LionsGate Creative
GoodPassRobot
Magelln
Quote Reply
Re: [AndyNewby] Now I'm excited!! In reply to
Andy from now on, when you write perl add:

use strict;

to the top :)
Quote Reply
Re: [dan] Now I'm excited!! In reply to
Quote:
Ahh, that's right. You are on a slow dial-up connection. Maybe you can have DMOZ snail mail you the file Do you not have access to broadband cable or ADSL Internet access where you live? With what you do (work the Net), you should have a faster connection.
Please don't taunt me. I'd love a faster connection Wink Thing is, I live in a smallish village about 35 miles from London, but we don't have access to Broadband services yet here. Best I would be able to get is ISDN (128). Still slow compared to ADSL etc, and still very expensive Frown

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [AndyNewby] Now I'm excited!! In reply to
Egad, I would have thought GB was more wired that that in terms of broadband access. Just 20 Km from London, and you don't access to cable or ADSL. Canada has a long ways to go in terms of providing Internet access to remote areas (but when I say remote, I mean remote - like hundreds of Kilometers out in the sticks), but cable (and ADSL) access is a gimme near larger population centres.



Cheers - Dan Cool

----
Cheers,

Dan
Founder and CEO

LionsGate Creative
GoodPassRobot
Magelln
Quote Reply
Re: [RedRum] Now I'm excited!! In reply to
Yep, that would be it. Much more cleaner and flexible code.



Cheers - Dan Cool

----
Cheers,

Dan
Founder and CEO

LionsGate Creative
GoodPassRobot
Magelln
Quote Reply
Re: [dan] Now I'm excited!! In reply to
Wooooooo

I just finished touching up the code and started it running.

I am just sitting he watching the files appear :)


Adult.part
Arts.part
Business.prt


....ahhh isn't it just so satisfying Angelic

Im up to society in under 4 minutes

Last edited by:

RedRum: Feb 14, 2002, 12:02 PM
Quote Reply
Re: [RedRum] Now I'm excited!! In reply to
It is fun to watch - especially when you consider the size of the file (almost a GB) it is working on. And with broadband access coupled with Apache / WGET installed locally, the whole process from start to finish is amazingly fast.



Cheers - Dan Cool

----
Cheers,

Dan
Founder and CEO

LionsGate Creative
GoodPassRobot
Magelln
Quote Reply
Re: [dan] Now I'm excited!! In reply to
Sure is.

Can I see your code and I'll send you mine?....just out of curiosity (in private bahaha).

I'm going to work on a shell version now.

Last edited by:

RedRum: Feb 14, 2002, 12:12 PM
Quote Reply
Re: [RedRum] Now I'm excited!! In reply to
Ok,

I have a browser version and a shell version (telnet/ssh) completed.

I've tested both and they do what they are meant to :)

With the ssh version you need to specify the path to the unzipped rdf file and the path to output the files. An example would be:

perl parse_rdf_shell.cgi --rdf-inpath=/path/to/content.rdf.u8 --rdf-outpath=/path/to/any/directory

In the normal version (not as highly recommended due to timeouts) you will need to edit the 2 variables at the top of the file specifying the paths.

Is anyone interested in the scripts themselves or would it be easier for me to offer the split up sections?....I can update them easily every week.

Any thoughts?

Last edited by:

RedRum: Feb 14, 2002, 1:10 PM
> >