Gossamer Forum: General: Perl Programming: Problem retrieving Meta-Tag from page

May 17, 2003, 12:24 PM

sanuk

User (103 posts)

May 17, 2003, 12:24 PM

Post #1 of 10

Shortcut

Problem retrieving Meta-Tag from page

Hi,

I am using a sort of page retrievel script based on the code from the "add spider" from links-2

If the page contains the normal Meta-Tag such as:

<META name="description" content="here are some words">

then everything is OK and it works to retrieve the Tag.

But I run into pages that have turned the Tag around as follows:

<META content="here are some words" name=description>

and for the above nothing is retrieved because the Tag is turned around

This is the snippet of code I use from add_Spider to retrieve the Tag:

if ($t =~ m!.*?<META([^\>]*?)(NAME|HTTP-EQUIV)="description"([^\>]*?)(CONTENT|VALUE)="([^\"]+)"!i){$Description = ' '.$5;}

I really dont have a Clue how to change the code so that it also retrieves the second, or turned around model of Meta-tag.

Thanks and Regards,

sanuk

May 17, 2003, 12:53 PM

Wil

Veteran / Moderator (4108 posts)

May 17, 2003, 12:53 PM

Post #2 of 10

Shortcut

Re: [sanuk] Problem retrieving Meta-Tag from page In reply to

use HTML::TokeParser;

- wil

Last edited by:

Wil: May 17, 2003, 12:54 PM

May 17, 2003, 2:19 PM

Paul

Veteran (19537 posts)

May 17, 2003, 2:19 PM

Post #3 of 10

Shortcut

Re: [sanuk] Problem retrieving Meta-Tag from page In reply to

You should use LWP::UserAgent as it retrieves page headers including the document title and description.

Attached is a demo script I just made (and tested).

I've left the error checking for you to add =)

Last edited by:

Paul: May 17, 2003, 2:20 PM

May 17, 2003, 2:51 PM

sanuk

User (103 posts)

May 17, 2003, 2:51 PM

Post #4 of 10

Shortcut

Re: [Paul] Problem retrieving Meta-Tag from page In reply to

Hi,
Thanks for the responses
I tried out the script and OK for the description
with changing description to:
$re->headers->{'x-meta-keywords'};
I also get the keywords
but $re->headers->{'x-meta-title'}; does not work ???

if I start using this good code of yours I then also need to be able to retrieve the body text of the Page
and to make even more problems for you:

I need to put each of them in a string
such as:
$title = "x-meta-title";
$description = "x-meta-description";
$keywords = "x-meta-keywords";
$content = "content of Page"

Reason being that I have to clean-out strange language characters from every string
and then further shorten/cut them to a certain maximum length

Any further help would be appriciated if not to much too ask of your valuable time.

Thanks and regards,
Sanuk

May 17, 2003, 4:06 PM

Wil

Veteran / Moderator (4108 posts)

May 17, 2003, 4:06 PM

Post #5 of 10

Shortcut

Re: [sanuk] Problem retrieving Meta-Tag from page In reply to

http://www.gossamer-threads.com/...i?post=200193#200193

- wil

May 17, 2003, 8:34 PM

sanuk

User (103 posts)

May 17, 2003, 8:34 PM

Post #6 of 10

Shortcut

Re: [Wil] Problem retrieving Meta-Tag from page In reply to

Hi,

Thanks for the response, Will

But I dont understand what you mean with sending me to this URL

Thanks and regards,

Sanuk

May 18, 2003, 2:16 AM

Wil

Veteran / Moderator (4108 posts)

May 18, 2003, 2:16 AM

Post #7 of 10

Shortcut

Re: [sanuk] Problem retrieving Meta-Tag from page In reply to

It's a code snippet to demonstrate the simple use of HTML::TokeParser.

- wil

May 18, 2003, 3:03 AM

Paul

Veteran (19537 posts)

May 18, 2003, 3:03 AM

Post #8 of 10

Shortcut

Re: [sanuk] Problem retrieving Meta-Tag from page In reply to

Hi,

What you require is pretty simple once you know how - you can do eveything with LWP::UserAgent, you don't need a html parsing module, try the attached script.

Last edited by:

Paul: May 18, 2003, 3:15 AM

May 18, 2003, 7:07 AM

sanuk

User (103 posts)

May 18, 2003, 7:07 AM

Post #9 of 10

Shortcut

Re: [Paul] Problem retrieving Meta-Tag from page In reply to

Hi,

Yes, thanks Paul . . .Works perfect

Sorry for the late reply, put I have difficulty getting on the net, with our prehistoric dial-up connection here.

By the way, whats your Idea of using

use LWP::RobotUA;

instead of: use LWP::UserAgent;

Seems everything is the same but only the latter obey the robot.txt file

Just dont know if it also give a warning if the retrieved file is inside the robots file.

Thanks and Regards,

Sanuk

May 18, 2003, 8:23 AM

Paul

Veteran (19537 posts)

May 18, 2003, 8:23 AM

Post #10 of 10

Shortcut

Re: [sanuk] Problem retrieving Meta-Tag from page In reply to

I've not used LWP::RobotUA before but I just took a look at the pod. It seems that you could use it if you are writing a robot. LWP::UserAgent is probably more a generic user agent class, whereas RobotUA (as the name suggests) is specifically for robots that need to adhere to robots.txt files.

Problem retrieving Meta-Tag from page

Last edited by:

Last edited by:

Attached Files:

Last edited by:

Attached Files: