Gossamer Forum: General: Perl Programming: Speed Theory...

Jan 26, 2004, 8:13 AM

Andy

Veteran / Moderator (18441 posts)

Jan 26, 2004, 8:13 AM

Post #1 of 14

Shortcut

Speed Theory...

I'm just wondering something. Does anyone know / has anyone done any benchmarking on this ...

Basically, I have to use the HTML from a routine in more than one place per script install. Basically this means that I am passing the HTML too-and-from routines quite a lot.

Something like this;

Code:
#!/usr/bin/perl 

use strict; 
use LWP::Simple; 

sub _test1 { 
# -------------------------------------- 


  my $got = get('http://www.perl.com') || die $!; 

  _test2($got); 

} 

sub _test2 { 
# -------------------------------------- 

   my $got = $_[0]; 

  # do something to $got here 
   $got = s/\<b\>(.+?)\<\/b\>/$1/gi; 

   print "Content-type: text/html \n\n"; 
   print $got; 

}

Basically, I wan't to know if it would be better (in terms of large pages), to pass the URL through the routine, and simply reuse get() to grab the page...

...or...

Keep passing the HTML through to the routines (i.e. I'm not sure if it is quicker to pass a URL into the servers memory, and then do stuff to it... or send the URL, and regrab the page and start processing?)

TIA

Andy (mod)
andy@ultranerds.co.uk

Jan 26, 2004, 12:55 PM

Chaz

Enthusiast (819 posts)

Jan 26, 2004, 12:55 PM

Post #2 of 14

Shortcut

Re: [Andy] Speed Theory... In reply to

I would think that calling get every time you need the HTML would not be a good idea. I would set it up so that it passes the url and only grabs the HTML once and only if needed:

Code:
#!/usr/bin/perl -w 
use strict; 
use LWP::Simple;  

use vars qw/%HTML/; 

sub _test1 {  
# --------------------------------------  

  my $url = 'http://www.perl.com'; 

  _test2($url);  

}  

sub _test2 {  
# --------------------------------------  

   my $url = $_[0];  

   my $got = get_page($url); 

   # do something to $got here  
   $got = s/\<b\>(.+?)\<\/b\>/$1/gi;  

   print "Content-type: text/html \n\n";  
   print $got;  

} 

sub get_page { 
  my $url = shift; 

  # Send the HTML if it already exists 
  return $HTML{$url} if exists $HTML{$url}; 

  # otherwise grab it 
  $HTML{$url} = get($url) || die $!; 

  return $HTML{$url}; 
}

That will give you a few benefits:

1) You don't have to pass big chunks of data around
2) You only get it when and if you need it
3) You only get it once

~Charlie

Jan 27, 2004, 12:56 AM

Andy

Veteran / Moderator (18441 posts)

Jan 27, 2004, 12:56 AM

Post #3 of 14

Shortcut

Re: [Chaz] Speed Theory... In reply to

I'm planning on using a foreach, with a list of URL's... wouldn't a hash gobble up a lot of memory? I know when I was working on an CSV file import script for a travel site, it was giving me errors about running out of memory. This was only with about 20,000 shortish' descriptions Unsure

Thanks for your reply so far :)

Cheers

Andy (mod)
andy@ultranerds.co.uk

Jan 27, 2004, 3:37 AM

Wil

Veteran / Moderator (4108 posts)

Jan 27, 2004, 3:37 AM

Post #4 of 14

Shortcut

Re: [Andy] Speed Theory... In reply to

LWP::Parallel is your friend.

Also, yes, invest in some memory. Lots of memory.

- wil

Jan 27, 2004, 6:51 AM

Chaz

Enthusiast (819 posts)

Jan 27, 2004, 6:51 AM

Post #5 of 14

Shortcut

Re: [Andy] Speed Theory... In reply to

Yeah, that would probably eat up lots of RAM. You may be stuck with passing around the HTML. I'd still avoid using get every time you need the html though. That's going to be time consuming.

~Charlie

Jan 27, 2004, 6:55 AM

Andy

Veteran / Moderator (18441 posts)

Jan 27, 2004, 6:55 AM

Post #6 of 14

Shortcut

Re: [Chaz] Speed Theory... In reply to

For say 100,000 HTML pages (i.e grabbed) in memory./.. do you have any idea how much memory this will use up? If I have enough RAM, I may just try and run it this way. I have 1Gb.

TIA

Andy (mod)
andy@ultranerds.co.uk

Jan 27, 2004, 7:29 AM

Chaz

Enthusiast (819 posts)

Jan 27, 2004, 7:29 AM

Post #7 of 14

Shortcut

Re: [Andy] Speed Theory... In reply to

It will depend greatly on the size of the page but you're talking several hundred MB's (assuming a 50k average page size) if you keep the HTML in memory.

Is this script going to be running as a cron job or part of an interactive script? If response time isn't an issue, maybe get won't be all that bad.

Have you looked into LWP::Paralell like Wil suggested? I'm not familiar with it so I don't know what that will do for you.

~Charlie

Jan 27, 2004, 7:35 AM

Andy

Veteran / Moderator (18441 posts)

Jan 27, 2004, 7:35 AM

Post #8 of 14

Shortcut

Re: [Chaz] Speed Theory... In reply to

Its a user based thing, *but* it only runs 5 links per page... when its going to handle more than 5 per page, it is run via SSH where it should be able to go through at least 100,000 URL's to check :(

Cheers

Andy (mod)
andy@ultranerds.co.uk

Jan 29, 2004, 8:09 AM

SeanP

User (456 posts)

Jan 29, 2004, 8:09 AM

Post #9 of 14

Shortcut

Re: [Andy] Speed Theory... In reply to

I'm not really great at Perl, but isn't there a way to do a while loop to parse them one at a time, instead of reading them all into memory?

Jan 29, 2004, 2:17 PM

Alex

Administrator (9387 posts)

Jan 29, 2004, 2:17 PM

Post #10 of 14

Shortcut

Re: [Andy] Speed Theory... In reply to

Also, you can pass by reference rather then copying the html when you call a sub.

i.e.:

my $foo = "some very large string";
process(\$foo);

sub process {
my $foo = shift;
print $$foo; # Dereference it to access it
}

Cheers,

Alex
--
Gossamer Threads Inc.

Jan 29, 2004, 2:19 PM

Andy

Veteran / Moderator (18441 posts)

Jan 29, 2004, 2:19 PM

Post #11 of 14

Shortcut

Re: [Alex] Speed Theory... In reply to

Thsnks, but that just went right over my head :(

Cheers

Andy (mod)
andy@ultranerds.co.uk

Jan 29, 2004, 2:28 PM

Alex

Administrator (9387 posts)

Jan 29, 2004, 2:28 PM

Post #12 of 14

Shortcut

Re: [Andy] Speed Theory... In reply to

Consider:

Code:
sub process1 { 
  my $string = shift; 
  $string =~ s/large/small/; 
} 

sub process2 { 
  my $string = shift; 
  $$string =~ s/large/small/; 
} 

my $foo = "large string\n"; 
process1($foo); 
print $foo; 

my $foo2 = "large string\n"; 
process2(\$foo2); 
print $foo2;

will output:

Code:
large string 
small string

In the first function, you make a copy of $foo1, so when you change it, you don't change the original one. In the second case, you pass a reference to $foo2, so when you change it, you are changing the original variable. So if you are dealing with very large strings, it's always better to pass a reference to it rather then make a copy each time you call a subroutine (unless of course you do actually want to change it, and don't want the original changed).

For more info, see:

http://perldoc.com/...ml#Pass-by-Reference

Cheers,

Alex
--
Gossamer Threads Inc.

Jan 29, 2004, 2:31 PM

Andy

Veteran / Moderator (18441 posts)

Jan 29, 2004, 2:31 PM

Post #13 of 14

Shortcut

Re: [Alex] Speed Theory... In reply to

Aaah... that makes more sense. Thanks :)

Andy (mod)
andy@ultranerds.co.uk

Mar 21, 2004, 12:42 PM

paladin

New User (4 posts)

Mar 21, 2004, 12:42 PM

Post #14 of 14

Shortcut

Re: [Andy] Speed Theory... In reply to

Within a limited context, a globally scoped variable with server as well . . .

My $got;

sub _test1
{
my $got = get('http://www.perl.com') || die $!;}

sub _test2
{
# do something to $got here

$got = s/\<b\>(.+?)\<\/b\>/$1/gi;

print "Content-type: text/html \n\n";
print $got;
}

Here $got is visible to both routines. Might not be best for large scripts though.