Gossamer Forum
Home : General : Perl Programming :

Discerning the title from a html string

Quote Reply
Discerning the title from a html string
A question most likely for Paul.

I have read an html file into a string. I need to extract from this string a title for the record. The title needs to be the text in between the first <h*>....</h*> tag or perhaps a <p><b></b></p> or <p></p> tag.

The format of the documents varies, in that some begin with a <h1> tag, whilst others begin with a <p>.

Any help would be very much appreciated.

I had tried using:
Code:
$title = $string;
$title =~ s/<h1>*{1,200}</h1>//i;
$title = $';
------------------
Dorg Hurgler Van Schongleur,
NordHein Van Resetelem, Belgium.
Eck SchekeBuugler Technologies.

Last edited by:

searchposts: May 18, 2002, 3:40 AM
Quote Reply
Re: [searchposts] Discerning the title from a html string In reply to
Is the title to be removed or extracted?....ie are you trying to nuke it, or store it in a variable?

Are there more than one on a page or is it a title appearing just one?....where on the page does it appear?
Quote Reply
Re: [Paul] Discerning the title from a html string In reply to
In Reply To:
Is the title to be removed or extracted?....ie are you trying to nuke it, or store it in a variable?

Are there more than one on a page or is it a title appearing just one?....where on the page does it appear?


1) The title needs to be extracted and placed into a $calar variable.

2) The title appears normally within the first few lines of the document. Sometimes the format is as such:



5th May 2005

N.E. One-Surname.

Title of the Document

-------------------------

whilst at other times this is the format:

Title of the Document

Body text here here



---------------------------

and in some cases there is no title. The html document never contains any heading as one would find on a normal web page such as a logo, as the html document is saved from a word document. In some cases the word document does not have a title, and as such needs a title to be determined by the script. The regex or sub needs to take the string and extract from that a logical title for the document. This basically automates the process of going through the document and determining a title by hand. Of course, in some cases, this will still need to be done, but for large numbers of html documents, automation is preferable.

I hope that this explains the problems slightly more coherently!

Thanks for your time, Paul.
------------------
Dorg Hurgler Van Schongleur,
NordHein Van Resetelem, Belgium.
Eck SchekeBuugler Technologies.
Quote Reply
Re: [searchposts] Discerning the title from a html string In reply to
$title =~ m,<h1>(.+?)</h1>,;

--Philip
Links 2.0 moderator
Quote Reply
Re: [sponge] Discerning the title from a html string In reply to
Hi Philip

What an odd choice of deliminater! :-)

- wil
Quote Reply
Re: [sponge] Discerning the title from a html string In reply to
That won't work with <p><b>Title</b></p> though will it :)

Last edited by:

Paul: May 18, 2002, 8:30 AM
Quote Reply
Re: [Wil] Discerning the title from a html string In reply to
>>What an odd choice of deliminater! :-) <<

Not really seeing as m// would have spewed an error.
Quote Reply
Re: [Paul] Discerning the title from a html string In reply to
Yeah, but a comma "," is really ugly when it comes to the end of the line and you put a semi-colon after it. I would of used something like | or ! myself. More 'standarized' if you ask me. I don't often see scripts flying around with that notation.

- wil
Quote Reply
Re: [sponge] Discerning the title from a html string In reply to
Thanks for the suggestion. My situation was not particularly well adapted to the solution, and so I settled on this:

if ($title =~ /^\s{1,}$/ || $title eq "") {

$title = $text;
$title =~ s/<[^>]*>//g;
$title = substr($title, 0 , 255);
$title =~ s/^\s{0,}//g;
$location = index($title, "
");
$title = substr($title, 0, $location);

}

Any improvements appreciated Angelic

Best Wishes,

searchposts.
------------------
Dorg Hurgler Van Schongleur,
NordHein Van Resetelem, Belgium.
Eck SchekeBuugler Technologies.