Gossamer Forum
Home : General : Perl Programming :

Suggestions for reading Web Page (Scraping)?

Quote Reply
Suggestions for reading Web Page (Scraping)?
I need to get text from a website, but have never really thought about it before...

Basically I need to compare data from a Govt Website in an automated form. I need a way to "get" a webpage and then "save" the contents. (From there I can use pattern matching to compare the numbers and find what I need.)

Someone suggested a "Screen Scraper"... In the forums here I've seen a couple of references - does anybody have any comments, suggestions?

I think for my purpose having the data saved as a text-based file is the best solution. I've done similar things with net::ftp, but never thought about trying it via http.

Any comments are welcome. Is this legal, ethical, etc.?
Quote Reply
Re: [Watts] Suggestions for reading Web Page (Scraping)? In reply to
You'd probably want to look at using LWP::Useragent and HTTP::Request.

Philip
------------------
Limecat is not pleased.
Quote Reply
Re: [Watts] Suggestions for reading Web Page (Scraping)? In reply to
Something like this should work;

Code:
#!/usr/bin/perl

use strict;

my $url = 'http://www.domain.com/something/foo/bar.html';
my $write_path = 'file1.txt';

use LWP::Simple;

my @page = get($url);

open(WRITEIT,">$write_path") || die "Cant write $write_path. Reason: $!";
print WRITEIT @page;
close(WRITEIT);

print "Content-type: text/html \n\n";
print "Done!";

Hope that helps.

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package (plugins total "value" $3,325 & rising, for just $350)| GLinks ULTRA Package PRO (plugins total "value" $5,625 & rising, for just $500)
Support Forum | Links SQL Plugins | DMOZ Dumps | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Compare our different Plugin packages *new* Free CSS Templates
Quote Reply
Re: [Andy] Suggestions for reading Web Page (Scraping)? In reply to
Thanks for the feedback guys!

Andy's example works. I think I'll be able to utilize it to start with.