Gossamer Forum
Home : General : Internet Technologies :

Wget and Meta refresh

Quote Reply
Wget and Meta refresh
Hi everybody,

I have a problem with Wget and I hope you can help me.

I am trying to fetch the HTML of a biological webpage by getting the results of a search key word.
My problem is that this page is displayed after a redirect page. So when I Wget this page I get the HTML of the redirect page and not that of the result page. The redirect page contains a META refresh tag which points to the same page as the results page.

Results page: "http://panther.appliedbiosystems.com/genes/geneList.do?searchType=basic&loggedPrivilege=false&fieldName=all&fieldValue=PDGFRB&listType=1&organism=&dataset=NCBI%3A+H.+sapiens"
Refresh page:
<META http-equiv="refresh" content="0; URL=/genes/geneList.do?searchType=basic&amp;loggedPrivilege=false&amp;fieldName=all&amp;fieldValue=PDGFRB&amp;listType=1&amp;organism=&amp;dataset=NCBI%3A+H.+sapiens">


Can anybody help me get around this redirect page?

Thanx in advance for your help.

BloodyMary
Quote Reply
Re: [BloodyMary] Wget and Meta refresh In reply to
Hi,

You're best bet, is to use get() to grab the page first, and look for the meta-refresh tag. For example;

Code:
use LWP::Simple;

my $page = get('http://www.domain.com/somepage.html');

my $newurl;
$page =~ m/\Q<META http-equiv="refresh" content="\E(.*?)URL=(.*?)\Q">\E/i and $newurl = $2;

if ($newurl =~ /http/i) {
# goto the $newurl value, and not the normal one
} else {
# process normal link
}

Hope that helps.

Cheers

Andy (mod)
andy@ultranerds.co.uk


IMPORTANT: I've now moved to ultranerds.co.uk, and the .com will no longer work!
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package (plugins total "value" $3,325 & rising, for just $350)| GLinks ULTRA Package PRO (plugins total "value" $5,625 & rising, for just $500)
Support Forum | Links SQL Plugins | DMOZ Dumps | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Compare our different Plugin packages *new* Free CSS Templates
Quote Reply
Re: [Andy] Wget and Meta refresh In reply to
Hi,

Thanx for your reply.

The problem is that the Meta refresh URL is the same as the Redirect URL (see my first post).
So the problem starts over and over again.
Never seen this before, so I can't think of a solution.

BloodyMary
Quote Reply
Re: [BloodyMary] Wget and Meta refresh In reply to
You're right. The refesh url is the same but on browser we can see the https protocal. I think the check the User-Agent. Did you try with LWP::UserAgent?

Cheers,

Cheers,

Dat

Scripts installation and plugin creation
Plugins
Quote Reply
Re: [BloodyMary] Wget and Meta refresh In reply to
Using single quotes around the URL should allow wget to follow the redirect eg. wget 'http://foo.com'
Quote Reply
Re: [Hargreaves] Wget and Meta refresh In reply to
Hi,

Thanx for your help.

I couldn't try your suggestions for a couple of days, because the site was down (eeeeeeeeeh, not because of me I hope?? hehe). So I just tried the single quote thing: no difference unfortunately...

The Agent module I have to look into. Never used it before, so I'm curious how it works.
I'm beginning to think that its faster to look for another site with the same information....

But if there are more suggestions: please let me know!

BloodyMary