Gossamer Forum
Home : General : Perl Programming :

Problem retrieving Meta-Tag from page

Quote Reply
Problem retrieving Meta-Tag from page
Hi,

I am using a sort of page retrievel script based on the code from the "add spider" from links-2

If the page contains the normal Meta-Tag such as:

<META name="description" content="here are some words">

then everything is OK and it works to retrieve the Tag.

But I run into pages that have turned the Tag around as follows:

<META content="here are some words" name=description>

and for the above nothing is retrieved because the Tag is turned around

This is the snippet of code I use from add_Spider to retrieve the Tag:

if ($t =~ m!.*?<META([^\>]*?)(NAME|HTTP-EQUIV)="description"([^\>]*?)(CONTENT|VALUE)="([^\"]+)"!i){$Description = ' '.$5;}

I really dont have a Clue how to change the code so that it also retrieves the second, or turned around model of Meta-tag.

Thanks and Regards,

sanuk
Quote Reply
Re: [sanuk] Problem retrieving Meta-Tag from page In reply to
use HTML::TokeParser;

- wil

Last edited by:

Wil: May 17, 2003, 12:54 PM
Quote Reply
Re: [sanuk] Problem retrieving Meta-Tag from page In reply to
You should use LWP::UserAgent as it retrieves page headers including the document title and description.

Attached is a demo script I just made (and tested).

I've left the error checking for you to add =)

Last edited by:

Paul: May 17, 2003, 2:20 PM
Quote Reply
Re: [Paul] Problem retrieving Meta-Tag from page In reply to
Hi,
Thanks for the responses
I tried out the script and OK for the description
with changing description to:
$re->headers->{'x-meta-keywords'};
I also get the keywords
but $re->headers->{'x-meta-title'}; does not work ???

if I start using this good code of yours I then also need to be able to retrieve the body text of the Page
and to make even more problems for you:

I need to put each of them in a string
such as:
$title = "x-meta-title";
$description = "x-meta-description";
$keywords = "x-meta-keywords";
$content = "content of Page"

Reason being that I have to clean-out strange language characters from every string
and then further shorten/cut them to a certain maximum length

Any further help would be appriciated if not to much too ask of your valuable time.

Thanks and regards,
Sanuk
Quote Reply
Re: [sanuk] Problem retrieving Meta-Tag from page In reply to
http://www.gossamer-threads.com/...i?post=200193#200193

- wil
Quote Reply
Re: [Wil] Problem retrieving Meta-Tag from page In reply to
Hi,

Thanks for the response, Will

But I dont understand what you mean with sending me to this URL

Thanks and regards,

Sanuk
Quote Reply
Re: [sanuk] Problem retrieving Meta-Tag from page In reply to
It's a code snippet to demonstrate the simple use of HTML::TokeParser.

- wil
Quote Reply
Re: [sanuk] Problem retrieving Meta-Tag from page In reply to
Hi,

What you require is pretty simple once you know how - you can do eveything with LWP::UserAgent, you don't need a html parsing module, try the attached script.

Last edited by:

Paul: May 18, 2003, 3:15 AM
Quote Reply
Re: [Paul] Problem retrieving Meta-Tag from page In reply to
Hi,

Yes, thanks Paul . . .Works perfect

Sorry for the late reply, put I have difficulty getting on the net, with our prehistoric dial-up connection here.

By the way, whats your Idea of using

use LWP::RobotUA;

instead of: use LWP::UserAgent;

Seems everything is the same but only the latter obey the robot.txt file

Just dont know if it also give a warning if the retrieved file is inside the robots file.



Thanks and Regards,

Sanuk
Quote Reply
Re: [sanuk] Problem retrieving Meta-Tag from page In reply to
I've not used LWP::RobotUA before but I just took a look at the pod. It seems that you could use it if you are writing a robot. LWP::UserAgent is probably more a generic user agent class, whereas RobotUA (as the name suggests) is specifically for robots that need to adhere to robots.txt files.