Gossamer Forum: General: Perl Programming: Parsing an XML feed

Oct 24, 2001, 7:15 AM

qango

Enthusiast (706 posts)

Oct 24, 2001, 7:15 AM

Post #1 of 17

Shortcut

Parsing an XML feed

Hi,

I'm being offered a third-party XML feed to supplement my own directory results (Links SQL V1.13). (Maybe some of you already do this?)

I'll obviously need to modify my search.cgi so that when no 'local' results are found it defaults to the XML feed and put those results in the template, and possibly offer a 'More Results' button or link.

However, I'm not familiar with XML and the third-party tech guy mentioned running the feed through an XML parser to display the results - what's an XML parser? Is it a separate script that extracts the data from the XML? Would I write something into search.cgi to parse the results or is it an Apache server module, or part of CGI.pm?

Any advice or assistance appreciated Smile

All the best
Shaun

Oct 25, 2001, 1:12 AM

Wil

Veteran / Moderator (4108 posts)

Oct 25, 2001, 1:12 AM

Post #2 of 17

Shortcut

Re: [qango] Parsing an XML feed In reply to

Hi

You probably need the Perl XML::Parser written by the infamous Larry Wall. You can learn more here:

http://wwwx.netheaven.com/...xmlparser/intro.html
http://www.xml.com/.../98/09/xml-perl.html

And you can download here:

ftp://wall.org/pub/larry/
http://www.cpan.org/modules/by-module/XML/

It should be very simple to do this, you know. I just don't have a clue how to go about it! In theory you should recieve a file of data that's commented with XML tags everywhere, and you use the parser to break down the file and extract the useful bits using the XML tags.

The 'official' Perl/XML FAQ is here, although it doesn't help you a lot:

http://www.perlxml.com/faq/perl-xml-faq.html

Your best bet to get started is probably the O'reilly network XML site, which is located at:

http://www.xml.com

Good luck! And please let me know how you get on. I'm very interesed in this at the moment.

Cheers

Wil

- wil

Nov 20, 2001, 4:13 AM

qango

Enthusiast (706 posts)

Nov 20, 2001, 4:13 AM

Post #3 of 17

Shortcut

Re: [Wil] Parsing an XML feed In reply to

Wil,

Thanks for the resources, I've now installed 'expat' and the XML::Parser module on my server. I've also modified LinksSQL search.cgi to run a 'sub' if it cannot find any results - now I just need to parse the XML and reformat it into actaul links (the hard part I think!).

My problem is that I keep getting errors from XML::Parser and I'm having difficulty finding any sort of technical documentation about it.

Does anyone have any perl snippets that I could examine or use that parse XML content, or additional advice/resources about the in depth use of XML::Parser with perl?

All the best
Shaun

Nov 20, 2001, 6:36 AM

Mark Badolato

Veteran (1509 posts)

Nov 20, 2001, 6:36 AM

Post #4 of 17

Shortcut

Re: [qango] Parsing an XML feed In reply to

Did you already go through the documentation that comes with the module?

Nov 20, 2001, 7:11 AM

Wil

Veteran / Moderator (4108 posts)

Nov 20, 2001, 7:11 AM

Post #5 of 17

Shortcut

Re: [qango] Parsing an XML feed In reply to

Sorry Shaun, I can't help you at all! I want to learn XML and everything about it but simply don't have the time at the moment. Maybe over the Christmass holidays <g>.

Have you looked for any books relating to this issue? Maybe the XML::Parser man pages, .pod pages or any documentation that comes with it? Surely there's a mailing list for it somewhere?

Cheers

- wil

Nov 20, 2001, 10:10 AM

qango

Enthusiast (706 posts)

Nov 20, 2001, 10:10 AM

Post #6 of 17

Shortcut

Re: [Wil] Parsing an XML feed In reply to

Hi,

I'm going to play around with it for a while and try to get a handle on how it all fits together.

I did find some references at Apache and I think I'll experiment with a few different XML feeds to get the hang of really basic parsing before I start trying to anything fancy with it ... small steps ... :-)

I'll probably come back to this when I know a little more ... thanks for the help.

All the best
Shaun

Nov 20, 2001, 10:02 PM

Matt G

User (155 posts)

Nov 20, 2001, 10:02 PM

Post #7 of 17

Shortcut

Re: [qango] Parsing an XML feed In reply to

Hi,

You might have a look at the BigWhat plugin... I believe that parses an XML feed.

--
Matt G

Nov 20, 2001, 11:24 PM

Alex

Administrator (9387 posts)

Nov 20, 2001, 11:24 PM

Post #8 of 17

Shortcut

Re: [qango] Parsing an XML feed In reply to

Hi,

Unless the XML is really complex, save yourself the grief and parse it in perl. If you look at the BigWhat plugin as Matt suggests, you'll see it's two regex's (pretty simple ones at that).

Cheers,

Alex
--
Gossamer Threads Inc.

Nov 21, 2001, 1:14 AM

qango

Enthusiast (706 posts)

Nov 21, 2001, 1:14 AM

Post #9 of 17

Shortcut

Re: [Alex] Parsing an XML feed In reply to

Thanks for the suggestion Matt & Alex, I'll take a look.

I've discovered a problem with my installation of 'expat' that's been stalling my progress - it's only reading the first two characters of the XML encoding string, so UTF-8 becomes 'ut', ISO-1890-1 becomes 'is', etc. and because it's configured to die when it encounters any problems I haven't, as yet, been able to parse any XML at all :(

I'll get it fixed up and then take a look at BigWhat.

My idea has always been to simply parse the XML, 'pull' the minimum link data, format it, and then add it to the search results template.

If I do get it working I'll post details here about how to set it up and post the perl code for search.cgi in case anyone else wants to use it.

Thanks again.

All the best
Shaun

Nov 21, 2001, 12:56 PM

scottward

Novice (30 posts)

Nov 21, 2001, 12:56 PM

Post #10 of 17

Shortcut

Re: [qango] Parsing an XML feed In reply to

I can tell you that the easiest way is to modify the BigWhat plugin and make your own out of it...

That's what I did, it works great...

I get results from 4 different sources...

Let me know if need any help...

-Scott

Nov 21, 2001, 3:38 PM

socrates

Enthusiast (832 posts)

Nov 21, 2001, 3:38 PM

Post #11 of 17

Shortcut

Re: [scottward] Parsing an XML feed In reply to

If anyone figured out a way to do this, I would like to know - please. I was contacted by another site to include their xml results but I don't know how to do it.

By the way, where is that bigwhat plugin available - site url, please?

thanks

Nov 21, 2001, 4:04 PM

scottward

Novice (30 posts)

Nov 21, 2001, 4:04 PM

Post #12 of 17

Shortcut

Re: [socrates] Parsing an XML feed In reply to

The Bigwhat Plugin is available at GT.

Just add the plugin through your plugin manager gallery...

It is fairly easy after you look at the script. After I properly package my plugin, I will make it available...

-Scott

Nov 22, 2001, 1:16 AM

qango

Enthusiast (706 posts)

Nov 22, 2001, 1:16 AM

Post #13 of 17

Shortcut

Re: [scottward] Parsing an XML feed In reply to

I'm still using LinksSQL 1.13 and assume the plugin for 2.x? If so, can I get a copy of the relevent parsing section of the code?

All the best
Shaun

Nov 22, 2001, 6:29 AM

scottward

Novice (30 posts)

Nov 22, 2001, 6:29 AM

Post #14 of 17

Shortcut

Re: [qango] Parsing an XML feed In reply to

It is for links SQL.. but here is the parsing code for mamma.com search engine....

Code:
# Parse Mamma HTML Results. 
    my @querylist=split(/\+/,$query); 
    my $word = ""; 
    $results =~ s,<!-- START LIST -->(.*?)<!-- END LIST -->,$1,s; 
    my $link_results = ''; 
    my $link_count   = 0; 
    while ($results =~ m,<!-- START ITEM -->(.*?)<!-- END ITEM -->,sg) { 
        my $link = $1; 
       my $bid = ""; 
       $link =~ m,000063>(.*?)</FONT>,mx; 
       my $num = $1; 
       $link =~ m,<br>(.*?)<br>,mx; 
       my $description = $1; 
       foreach $word (@querylist){my $boldword="<b>$word</b>"; $description =~ s/$word/$boldword/gei;} 
       $link =~ m,href="(.*?)">,mx; 
       my $redirect = $1; 
       $link =~ m,">(.*?)</a></b>,mx; 
       my $title = $1; 
       foreach $word (@querylist){my $boldword="<b>$word</b>"; $title =~ s/$word/$boldword/gei;} 
       $link =~ m,1>(.*?)</font>,mix; 
       my $url = $1; 
       foreach $word (@querylist){my $boldword="<b>$word</b>"; $url =~ s/$word/$boldword/gei;} 

Last edited by:

scottward: Nov 22, 2001, 6:30 AM

Nov 26, 2001, 9:06 AM

qango

Enthusiast (706 posts)

Nov 26, 2001, 9:06 AM

Post #15 of 17

Shortcut

Re: [scottward] Parsing an XML feed In reply to

Hi,

I've made a start:- installed XML::Parser, upgraded to perl 5.6.1 (to get expat.pm working) and cobbled the first bit of code to fetch an XML feed and parse it:

Quote:

my ($websearch);
use LWP::Simple;
$websearch = get("http://p.moreover.com/cgi-local/page?index_webdeveloper+xml");

if ($websearch) {
use XML::Parser;
my $xmlparse = new XML::Parser(ErrorContext => 2);
$xmlparse->parse($websearch);
print $xmlparse;

When printed, $xmlparse is as follows:
XML::Parser=HASH(0x83639d0)

I assumed that by using XML::Parser it would print out the data from each element within the XML document? I wasn't expecting the 'HASH' thingy!!

To really show my ignorance, what is a HASH and how do you covert it/break it down into readable/useable data strings, or even print out its contents? (I did search but gave up after a couple hours!)

I'm thinking that a sub routine to format the xml data would be useful and flexible for a variety of different feeds, and could be used to trigger formatting based on the xml tag that is being looked at.

I suppose what I'm really looking for is a simple breakdown of the xml data into its 'tag' and 'data', so I can format my output based on the items in the feed, e.g.;

if ($xml_tag = "<url>") {$xml_results .= $xml_data formatted as a URL}
if ($xml_tag = "<title>") {$xml_results .= $xml_data formatted for link title}

This way, if I change XML feeds I can simply update the 'triggers' to suit the new feed and the actual output I produce from the feed will remain the same.

Here's a basic low-down on what I'm trying to do (if it helps visualise my idea):

• get xml feed (LWP::Simple)
• parse xml feed (XML::Parser)
• run through xml feed data and extract+format all or selected data elements (sub xml_extract)
• put formatted elements into output and return ($output)
• display results (xmlresults -> $output)

I hope this makes some sense, and any help would be much appreciated.

Thanks.

All the best
Shaun

Nov 26, 2001, 9:16 AM

Wil

Veteran / Moderator (4108 posts)

Nov 26, 2001, 9:16 AM

Post #16 of 17

Shortcut

Re: [qango] Parsing an XML feed In reply to

Do these help?

http://wwwx.netheaven.com/...mlparser/Parser.html
http://www.xml.com/.../98/09/xml-perl.html

Cheers

- wil

Nov 26, 2001, 9:17 AM

Paul

Veteran (19537 posts)

Nov 26, 2001, 9:17 AM

Post #17 of 17

Shortcut

Re: [qango] Parsing an XML feed In reply to

>>To really show my ignorance, what is a HASH and how do you covert it/break it down into readable/useable data strings, or even print out its contents? (I did search but gave up after a couple hours!)
<<

HASH means that $xmlparse is a hash reference.

Instead of print $xmlparse; try.....

print $xmlparse->{$_} foreach keys %{$xmlparse};

Last edited by:

PaulW: Nov 26, 2001, 9:18 AM