Gossamer Forum
Home : General : Perl Programming :

HTML parsing

Quote Reply
HTML parsing
hi,
I am newbie to perl and i needed some help with parsing some HTML text.

My HTML tree looks like this:

html
head
body
h1
pre
....
h3......some text...</h3>...some more
text....<br>...more text.....<p><img.....>
h3
p



i am basically trying to extract text inside the first h3 tag...until the second h3 tag. (I dont need the <p> <img..> either)

i am able to get the text enclosed within the <h3>...</h3> tags...i am having trouble with text outside it...

thanks in advance

kiran
Quote Reply
Re: [hakiran] HTML parsing In reply to
Quote:
i am basically trying to extract text inside the first h3 tag

Quote:
i am able to get the text enclosed within the <h3>...</h3> tags

Im not sure of the problem?
Quote Reply
Re: [Paul] HTML parsing In reply to
i guess the indentation was lost when i posted it...

the tree starts with Body as the root tag...and has as its child elements

h1, pre, h3, h3...

i wanted to extract the text information staring from the first h3 tag until the beginning of the second h3 tag.

Body

<h1>

<pre>

<h3>......text.....</h3>.......some more text.....<br>.....more text.....<p><img...> (want to extract text in this line)

<h3>....

</body>



thanks
Quote Reply
Re: [hakiran] HTML parsing In reply to
I'm still not 100% clear - is this any closer?

my ($extracted) = $html =~ m|</h3>(.*?)<h3>|si;
Quote Reply
Re: [Paul] HTML parsing In reply to
sorry about that...i forgot to mention that i wanted to extract the text using features of HTML::TreeBuilder module.