Gossamer Forum
Home : General : Perl Programming :

HTML::Parse Depreciated

Quote Reply
HTML::Parse Depreciated
I just wrote a bit of code to grab all the links off of web sites. When I finished writing it I find that one of the modules I used (HTML::Parse) is now depreciated!Frown

So I go to have a look at HTML:Parser which seems to be the replacement for this, but looks sooooo different (to an amateur like me at least).

Has anyone used HTML:Parser and could show me what the replacement for this would be: (I have tried a few things, and none worked).

#----------------------------------
#use HTML::Parse;
#use HTML::Element;
#use URI::URL;
# are required for this part
my $html = get $url;
my $parsed_html = HTML::Parse::parse_html($html);

for (@{ $parsed_html->extract_links() }) {
my $link=$_->[0];
my $urlb = new URI::URL $link;
my $full_url = $urlb->abs($url);
print qq~$full_url<br>~;
}

#----------------------------------


http://www.iuni.com/...tware/web/index.html
Links Plugins
Quote Reply
Re: [Ian] HTML::Parse Depreciated In reply to
Ah the beauty of writing your own parser Smile

Have you looked here?

http://www.perldoc.com/...lib/HTML/Parser.html
Quote Reply
Re: [Paul] HTML::Parse Depreciated In reply to
Yes I have thanks Paul.

And I think I am going to agree with you. I am going to have to write my ownSmile. Regex lesson 101 time has finally come!

It is what goes between the ^ and the $ that does the magic eh.

I need to pull out the anything starting with http:// but finding the end of this is the trick. I guess look for href="**HERE**".


http://www.iuni.com/...tware/web/index.html
Links Plugins
Quote Reply
Re: [Ian] HTML::Parse Depreciated In reply to
Code:
$string =~ s#(http://[^"'>]+)#print $1#egi;

Last edited by:

Paul: Jun 13, 2002, 11:22 AM
Quote Reply
Re: [Paul] HTML::Parse Depreciated In reply to
Quote:
$string =~ s#(http://[^"'>]+)#print $1#egi;


Paul, you make it seem so easy.

I will have to analyse this statement, so I can learn it myself!

My font has gone big againCrazy.... either I am just having trouble with my posts today, or something weird is going on.



http://www.iuni.com/...tware/web/index.html
Links Plugins
Quote Reply
Re: [Ian] HTML::Parse Depreciated In reply to
Thats just a quick example but it should pick out most well formatted links.

$string would just contain the html eg...

my $string = join("", get("http://www.wiredon.net/"));
Quote Reply
Re: [Paul] HTML::Parse Depreciated In reply to
So who needs HTML::Parser anyways!

I guess I might if I can't get LWP::Simple to return the page status (200, 302 etc).... I thought that head($url) would have done this, but it just returns other stuff, which is not particularly useful in my case.

It seems the less modules you have to use, the more portable your code is, and less likely to become outdated or depreciated as the case may be.

Edit: I think my font issues stem from the horizontal lines created by qoute, code or just a <hr>. If i backsapce over one or remove a quote in a certain way, my formatting seems to change... hmmm.


http://www.iuni.com/...tware/web/index.html
Links Plugins

Last edited by:

Ian: Jun 13, 2002, 11:44 AM
Quote Reply
Re: [Ian] HTML::Parse Depreciated In reply to
>>
It seems the less modules you have to use, the more portable your code is, and less likely to become outdated or depreciated as the case may be.
<<

Unless you write your own like GT.
Quote Reply
Re: [Paul] HTML::Parse Depreciated In reply to
Yes, I am not at that stage yet... still grasping the basics. But I can see how advantagous that would be.

By the way... my host should talk to you about a decent support ticket system.


http://www.iuni.com/...tware/web/index.html
Links Plugins
Quote Reply
Re: [Ian] HTML::Parse Depreciated In reply to
>>By the way... my host should talk to you about a decent support ticket system. <<

Is that a coincidence or have you been spying on me? Cool ....thats what I've been working on for the last 4-5 days (and what my parser is for).

I've just been ripping my hair out with the parser....you don't know how difficult it is mixing real perl code with what looks like perl code but isn't until it is compiled Smile ....that made no sense I bet so here's an example:

Code:
sub loop_tag {
#----------------------------------------------------------
# This code is compiled later on.
# Not everything is as it appears..bahah!

return qq|
if (exists \$TAGS->{$_[1]}) {
print \$TAGS->{$_[1]};
}
elsif (exists \$GLB->{$_[1]}) {
if (substr(\$GLB->{$_[1]}, 0, 5) eq 'sub {') {
local \$SIG{__DIE__};
my \$code = eval \$GLB->{$_[1]};
print ref \$code eq 'CODE' ? \$code->($_[2]) : \$@;
}
else {
print \$GLB->{$_[1]};
}
}
else {
print sprintf(\$ERRORS->{UNKNOWN}, q{$_[1]});
}
|;
}

That code is checking whether an encountered tag is a tag, global or global code reference meaning that in my globals file I can use:

some_tag => 'sub { some code }'

....basically in the same way GT::Template::Parser does. Anyway sorry I'm going off topic.
Quote Reply
Re: [Paul] HTML::Parse Depreciated In reply to
Paul, it looks very complex. Something one chuck at a time is all the human brain could deal with I bet!

Parsing out html tags from web pages looks like baby steps compaired to your parser.

Actually, you showed me your ticket system, and it looks great! Gosh, if I get a few more customers, I might need oneWink


http://www.iuni.com/...tware/web/index.html
Links Plugins
Quote Reply
Re: [Ian] HTML::Parse Depreciated In reply to
>>
Actually, you showed me your ticket system, and it looks great! Gosh, if I get a few more customers, I might need one
<<

Was that a while ago?....if it was longer than a week then that one hit the recycle bin Cool

I started a new one a few days ago that is so much better.
Quote Reply
Re: [Paul] HTML::Parse Depreciated In reply to
Really, can I peak? (yes it was over a week ago - I think... my brain seems to only retain code these days, and not things like what I was doing a few days ago, or putting the garbage out!).


http://www.iuni.com/...tware/web/index.html
Links Plugins
Quote Reply
Re: [Ian] HTML::Parse Depreciated In reply to
Wrong. The more (decent) modules you use the MORE portable your code will become by a very very very long shot.

Tried messing around with use input lately? How about line feeds on your friend's Mac system compared to your linux box compared to your office XP machine? Fun, huh? :-)

- wil
Quote Reply
Re: [Wil] HTML::Parse Depreciated In reply to
So how common would something like HTML::Parser be?


http://www.iuni.com/...tware/web/index.html
Links Plugins
Quote Reply
Re: [Ian] HTML::Parse Depreciated In reply to
I'm sorry, Wil is incorrect. Using standard modules is ok but once you start using non-standard modules then that will impact on how many people use your script. You wouldn't believe how many hosts kick up a fuss about installing modules (or charge you).....so people will just avoid the hassle and look elsewhere for another script - I'm talking from first hand experience...I'm not saying that is always the case but it certainly happens a lot.

I believe this is also one of the main reasons that GT have their own library including tweaked standard modules, like GT::Dumper instead of Data::Dumper etc.....because you can't rely on other servers to have everything you need and if you intend on making your script sell then this is a large consideration.

Last edited by:

Paul: Jun 14, 2002, 1:47 AM
Quote Reply
Re: [Paul] HTML::Parse Depreciated In reply to
Ok, I can see I need to choose a method one way or another. Nothing personal to anyone, but for the sake of learning how to write my own library, I am going to attempt the write my own method. If this does not work out, than use these modules, I will.

Thanks Wil and Paul. I have a little better appreciation of pros and cons of using these library modules.


http://www.iuni.com/...tware/web/index.html
Links Plugins
Quote Reply
Re: [Ian] HTML::Parse Depreciated In reply to
I think the best approach, and I think Paul will agree with me here, is that if you have the time and the energy to learn propoerly - go ahead and write your own module. You will learn a lot from it. But, at the same time, download a module that does the same job from CPAN. Check your code against the CPAN module, see how it's author does things. Mahybe you'll disagree with some points and do it your own way - maybe you'll find way of improving his code and end up contributing a patch to an exisiting module (this often happens). Maybe you'll even 'borrow' some functions for the more difficult parts that you don't quite understand yet. This is also often fine under various liscences - you'll notice that GT do this in a few of their own modules (but not for the same reason as stated here).

Rgds

- wil
Quote Reply
Re: [Wil] HTML::Parse Depreciated In reply to
Hi Wil,

A most excellent idea. I was wondering about looking the modules actually. I think you told me at least once in other posts, that CPAN is my friend. Well, the time has come, and I think they are.

Thanks!


http://www.iuni.com/...tware/web/index.html
Links Plugins