Gossamer Forum
Home : General : Perl Programming :

Speed Theory...

Quote Reply
Speed Theory...
I'm just wondering something. Does anyone know / has anyone done any benchmarking on this ...

Basically, I have to use the HTML from a routine in more than one place per script install. Basically this means that I am passing the HTML too-and-from routines quite a lot.

Something like this;

Code:
#!/usr/bin/perl

use strict;
use LWP::Simple;

sub _test1 {
# --------------------------------------


my $got = get('http://www.perl.com') || die $!;

_test2($got);

}

sub _test2 {
# --------------------------------------

my $got = $_[0];

# do something to $got here
$got = s/\<b\>(.+?)\<\/b\>/$1/gi;

print "Content-type: text/html \n\n";
print $got;

}

Basically, I wan't to know if it would be better (in terms of large pages), to pass the URL through the routine, and simply reuse get() to grab the page...

...or...

Keep passing the HTML through to the routines (i.e. I'm not sure if it is quicker to pass a URL into the servers memory, and then do stuff to it... or send the URL, and regrab the page and start processing?)

TIA

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Speed Theory... In reply to
I would think that calling get every time you need the HTML would not be a good idea. I would set it up so that it passes the url and only grabs the HTML once and only if needed:

Code:
#!/usr/bin/perl -w
use strict;
use LWP::Simple;

use vars qw/%HTML/;

sub _test1 {
# --------------------------------------

my $url = 'http://www.perl.com';

_test2($url);

}

sub _test2 {
# --------------------------------------

my $url = $_[0];

my $got = get_page($url);

# do something to $got here
$got = s/\<b\>(.+?)\<\/b\>/$1/gi;

print "Content-type: text/html \n\n";
print $got;

}

sub get_page {
my $url = shift;

# Send the HTML if it already exists
return $HTML{$url} if exists $HTML{$url};

# otherwise grab it
$HTML{$url} = get($url) || die $!;

return $HTML{$url};
}

That will give you a few benefits:

1) You don't have to pass big chunks of data around
2) You only get it when and if you need it
3) You only get it once

~Charlie
Quote Reply
Re: [Chaz] Speed Theory... In reply to
I'm planning on using a foreach, with a list of URL's... wouldn't a hash gobble up a lot of memory? I know when I was working on an CSV file import script for a travel site, it was giving me errors about running out of memory. This was only with about 20,000 shortish' descriptions Unsure

Thanks for your reply so far :)

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Speed Theory... In reply to
LWP::Parallel is your friend.

Also, yes, invest in some memory. Lots of memory.

- wil
Quote Reply
Re: [Andy] Speed Theory... In reply to
Yeah, that would probably eat up lots of RAM. You may be stuck with passing around the HTML. I'd still avoid using get every time you need the html though. That's going to be time consuming.

~Charlie
Quote Reply
Re: [Chaz] Speed Theory... In reply to
For say 100,000 HTML pages (i.e grabbed) in memory./.. do you have any idea how much memory this will use up? If I have enough RAM, I may just try and run it this way. I have 1Gb.

TIA

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Speed Theory... In reply to
It will depend greatly on the size of the page but you're talking several hundred MB's (assuming a 50k average page size) if you keep the HTML in memory.

Is this script going to be running as a cron job or part of an interactive script? If response time isn't an issue, maybe get won't be all that bad.

Have you looked into LWP::Paralell like Wil suggested? I'm not familiar with it so I don't know what that will do for you.

~Charlie
Quote Reply
Re: [Chaz] Speed Theory... In reply to
Its a user based thing, *but* it only runs 5 links per page... when its going to handle more than 5 per page, it is run via SSH where it should be able to go through at least 100,000 URL's to check :(

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Speed Theory... In reply to
I'm not really great at Perl, but isn't there a way to do a while loop to parse them one at a time, instead of reading them all into memory?
Quote Reply
Re: [Andy] Speed Theory... In reply to
Also, you can pass by reference rather then copying the html when you call a sub.

i.e.:

my $foo = "some very large string";
process(\$foo);

sub process {
my $foo = shift;
print $$foo; # Dereference it to access it
}

Cheers,

Alex
--
Gossamer Threads Inc.
Quote Reply
Re: [Alex] Speed Theory... In reply to
Thsnks, but that just went right over my head :(

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Speed Theory... In reply to
Consider:

Code:
sub process1 {
my $string = shift;
$string =~ s/large/small/;
}

sub process2 {
my $string = shift;
$$string =~ s/large/small/;
}

my $foo = "large string\n";
process1($foo);
print $foo;

my $foo2 = "large string\n";
process2(\$foo2);
print $foo2;

will output:

Code:
large string
small string

In the first function, you make a copy of $foo1, so when you change it, you don't change the original one. In the second case, you pass a reference to $foo2, so when you change it, you are changing the original variable. So if you are dealing with very large strings, it's always better to pass a reference to it rather then make a copy each time you call a subroutine (unless of course you do actually want to change it, and don't want the original changed).

For more info, see:

http://perldoc.com/...ml#Pass-by-Reference

Cheers,

Alex
--
Gossamer Threads Inc.
Quote Reply
Re: [Alex] Speed Theory... In reply to
Aaah... that makes more sense. Thanks :)

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Speed Theory... In reply to
Within a limited context, a globally scoped variable with server as well . . .



My $got;


sub _test1
{
my $got = get('http://www.perl.com') || die $!;}

sub _test2
{
# do something to $got here

$got = s/\<b\>(.+?)\<\/b\>/$1/gi;

print "Content-type: text/html \n\n";
print $got;
}

Here $got is visible to both routines. Might not be best for large scripts though.