Gossamer Forum
Home : General : Internet Technologies :

Wget and Meta refresh

Quote Reply
Wget and Meta refresh
Hi everybody,

I have a problem with Wget and I hope you can help me.

I am trying to fetch the HTML of a biological webpage by getting the results of a search key word.
My problem is that this page is displayed after a redirect page. So when I Wget this page I get the HTML of the redirect page and not that of the result page. The redirect page contains a META refresh tag which points to the same page as the results page.

Results page: "http://panther.appliedbiosystems.com/genes/geneList.do?searchType=basic&loggedPrivilege=false&fieldName=all&fieldValue=PDGFRB&listType=1&organism=&dataset=NCBI%3A+H.+sapiens"
Refresh page:
<META http-equiv="refresh" content="0; URL=/genes/geneList.do?searchType=basic&amp;loggedPrivilege=false&amp;fieldName=all&amp;fieldValue=PDGFRB&amp;listType=1&amp;organism=&amp;dataset=NCBI%3A+H.+sapiens">


Can anybody help me get around this redirect page?

Thanx in advance for your help.

BloodyMary
Quote Reply
Re: [BloodyMary] Wget and Meta refresh In reply to
Hi,

You're best bet, is to use get() to grab the page first, and look for the meta-refresh tag. For example;

Code:
use LWP::Simple;

my $page = get('http://www.domain.com/somepage.html');

my $newurl;
$page =~ m/\Q<META http-equiv="refresh" content="\E(.*?)URL=(.*?)\Q">\E/i and $newurl = $2;

if ($newurl =~ /http/i) {
# goto the $newurl value, and not the normal one
} else {
# process normal link
}

Hope that helps.

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Wget and Meta refresh In reply to
Hi,

Thanx for your reply.

The problem is that the Meta refresh URL is the same as the Redirect URL (see my first post).
So the problem starts over and over again.
Never seen this before, so I can't think of a solution.

BloodyMary
Quote Reply
Re: [BloodyMary] Wget and Meta refresh In reply to
You're right. The refesh url is the same but on browser we can see the https protocal. I think the check the User-Agent. Did you try with LWP::UserAgent?

Cheers,

Cheers,

Dat

Programming and creating plugins and templates
Blog
Quote Reply
Re: [BloodyMary] Wget and Meta refresh In reply to
Using single quotes around the URL should allow wget to follow the redirect eg. wget 'http://foo.com'
Quote Reply
Re: [Hargreaves] Wget and Meta refresh In reply to
Hi,

Thanx for your help.

I couldn't try your suggestions for a couple of days, because the site was down (eeeeeeeeeh, not because of me I hope?? hehe). So I just tried the single quote thing: no difference unfortunately...

The Agent module I have to look into. Never used it before, so I'm curious how it works.
I'm beginning to think that its faster to look for another site with the same information....

But if there are more suggestions: please let me know!

BloodyMary