Gossamer Forum
Home : General : Perl Programming :

Spidering a Google query...

Quote Reply
Spidering a Google query...
 
Hi All,

I'm trying to grab a page query from google to parse, is this possible? Here's a snip of code to demonstrate my problem...

Code:
use LWP::Simple;

my $url = qq|http://www.google.co.uk/|; ## This works okay...

#my $url = qq|http://www.google.co.uk/search\?q=cartoons|; ## This doesn't!

my $file = "/_google/google.htm";

&getfile($url,$file);

exit;

sub getfile {
my ($url,$file) = @_;
print "Grab: $url\n";
my $rc = mirror($url, $file);
if ($rc == 304) {
print "Up To Date.\n";
} elsif (!is_success($rc)) {
warn "Error: $rc ", status_message($rc), " ($url)\n";
return(0);
}
}

The problem is that a 403 error is received, my guess is that I'm going to have to send a spoof header set so the request looks like IE to google... or am I wandering off in totally the wrong direction and wasting my time?


moog
-- I've spent most of my money on beer and women... the rest I just wasted.
Quote Reply
Re: [moog] Spidering a Google query... In reply to
In a respect you are wasting your time, because Google will block your IP for grabbing their content.

Anyhow, I'd do it like this:

Code:
use LWP::UserAgent;
use HTTP::Request::Common qw(GET);

my $fe = qq{/path/to/file};
my $ur = qq{http://www.google.co.uk/search?q=cartoons};
my $ua = LWP::UserAgent->new( agent => 'Paul/1.0', timeout => 30 );
my $re = $ua->request( GET $ur );

if ($re->is_success) {
my $fh = \do { local *FH; *FH };
open $fh, $fe or die qq{Can't open $file: $!};
print $fh $re->content;
}
else {
# Error.
}
Quote Reply
Re: [Paul] Spidering a Google query... In reply to
Thanks Paul,

I was toying with the idea of going down the LWP::UserAgent route...

If request speeds were sympathetic to the server and not made to a set time cycle would they still spot me?


moog
-- I've spent most of my money on beer and women... the rest I just wasted.
Quote Reply
Re: [moog] Spidering a Google query... In reply to
Perhaps not - I got caught though just testing code like I showed you above and they blocked me Blush
Quote Reply
Re: [Paul] Spidering a Google query... In reply to
Ooops... I think I just sold my soul as well... I'm getting a 'Forbidden Client' returned... I even tried:

Code:
agent => 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)'


moog
-- I've spent most of my money on beer and women... the rest I just wasted.