Gossamer Forum
Home : General : Perl Programming :

Parsing an XML feed

Quote Reply
Parsing an XML feed
Hi,

I'm being offered a third-party XML feed to supplement my own directory results (Links SQL V1.13). (Maybe some of you already do this?)

I'll obviously need to modify my search.cgi so that when no 'local' results are found it defaults to the XML feed and put those results in the template, and possibly offer a 'More Results' button or link.

However, I'm not familiar with XML and the third-party tech guy mentioned running the feed through an XML parser to display the results - what's an XML parser? Is it a separate script that extracts the data from the XML? Would I write something into search.cgi to parse the results or is it an Apache server module, or part of CGI.pm?

Any advice or assistance appreciated Smile

All the best
Shaun
Quote Reply
Re: [qango] Parsing an XML feed In reply to
Hi

You probably need the Perl XML::Parser written by the infamous Larry Wall. You can learn more here:

http://wwwx.netheaven.com/...xmlparser/intro.html
http://www.xml.com/.../98/09/xml-perl.html

And you can download here:

ftp://wall.org/pub/larry/
http://www.cpan.org/modules/by-module/XML/

It should be very simple to do this, you know. I just don't have a clue how to go about it! In theory you should recieve a file of data that's commented with XML tags everywhere, and you use the parser to break down the file and extract the useful bits using the XML tags.

The 'official' Perl/XML FAQ is here, although it doesn't help you a lot:

http://www.perlxml.com/faq/perl-xml-faq.html

Your best bet to get started is probably the O'reilly network XML site, which is located at:

http://www.xml.com

Good luck! And please let me know how you get on. I'm very interesed in this at the moment.

Cheers

Wil




- wil
Quote Reply
Re: [Wil] Parsing an XML feed In reply to
Wil,

Thanks for the resources, I've now installed 'expat' and the XML::Parser module on my server. I've also modified LinksSQL search.cgi to run a 'sub' if it cannot find any results - now I just need to parse the XML and reformat it into actaul links (the hard part I think!).

My problem is that I keep getting errors from XML::Parser and I'm having difficulty finding any sort of technical documentation about it.

Does anyone have any perl snippets that I could examine or use that parse XML content, or additional advice/resources about the in depth use of XML::Parser with perl?

All the best
Shaun
Quote Reply
Re: [qango] Parsing an XML feed In reply to
Did you already go through the documentation that comes with the module?
Quote Reply
Re: [qango] Parsing an XML feed In reply to
Sorry Shaun, I can't help you at all! I want to learn XML and everything about it but simply don't have the time at the moment. Maybe over the Christmass holidays <g>.

Have you looked for any books relating to this issue? Maybe the XML::Parser man pages, .pod pages or any documentation that comes with it? Surely there's a mailing list for it somewhere?

Cheers

- wil
Quote Reply
Re: [Wil] Parsing an XML feed In reply to
Hi,

I'm going to play around with it for a while and try to get a handle on how it all fits together.

I did find some references at Apache and I think I'll experiment with a few different XML feeds to get the hang of really basic parsing before I start trying to anything fancy with it ... small steps ... :-)

I'll probably come back to this when I know a little more ... thanks for the help.

All the best
Shaun
Quote Reply
Re: [qango] Parsing an XML feed In reply to
Hi,

You might have a look at the BigWhat plugin... I believe that parses an XML feed.

--
Matt G
Quote Reply
Re: [qango] Parsing an XML feed In reply to
Hi,

Unless the XML is really complex, save yourself the grief and parse it in perl. If you look at the BigWhat plugin as Matt suggests, you'll see it's two regex's (pretty simple ones at that).

Cheers,

Alex
--
Gossamer Threads Inc.
Quote Reply
Re: [Alex] Parsing an XML feed In reply to
Thanks for the suggestion Matt & Alex, I'll take a look.

I've discovered a problem with my installation of 'expat' that's been stalling my progress - it's only reading the first two characters of the XML encoding string, so UTF-8 becomes 'ut', ISO-1890-1 becomes 'is', etc. and because it's configured to die when it encounters any problems I haven't, as yet, been able to parse any XML at all :(

I'll get it fixed up and then take a look at BigWhat.

My idea has always been to simply parse the XML, 'pull' the minimum link data, format it, and then add it to the search results template.

If I do get it working I'll post details here about how to set it up and post the perl code for search.cgi in case anyone else wants to use it.

Thanks again.

All the best
Shaun
Quote Reply
Re: [qango] Parsing an XML feed In reply to
I can tell you that the easiest way is to modify the BigWhat plugin and make your own out of it...

That's what I did, it works great...

I get results from 4 different sources...

Let me know if need any help...

-Scott
Quote Reply
Re: [scottward] Parsing an XML feed In reply to
If anyone figured out a way to do this, I would like to know - please. I was contacted by another site to include their xml results but I don't know how to do it.

By the way, where is that bigwhat plugin available - site url, please?

thanks
Quote Reply
Re: [socrates] Parsing an XML feed In reply to
The Bigwhat Plugin is available at GT.

Just add the plugin through your plugin manager gallery...

It is fairly easy after you look at the script. After I properly package my plugin, I will make it available...

-Scott
Quote Reply
Re: [scottward] Parsing an XML feed In reply to
I'm still using LinksSQL 1.13 and assume the plugin for 2.x? If so, can I get a copy of the relevent parsing section of the code?

All the best
Shaun
Quote Reply
Re: [qango] Parsing an XML feed In reply to
It is for links SQL.. but here is the parsing code for mamma.com search engine....

Code:
# Parse Mamma HTML Results.
my @querylist=split(/\+/,$query);
my $word = "";
$results =~ s,<!-- START LIST -->(.*?)<!-- END LIST -->,$1,s;
my $link_results = '';
my $link_count = 0;
while ($results =~ m,<!-- START ITEM -->(.*?)<!-- END ITEM -->,sg) {
my $link = $1;
my $bid = "";
$link =~ m,000063>(.*?)</FONT>,mx;
my $num = $1;
$link =~ m,<br>(.*?)<br>,mx;
my $description = $1;
foreach $word (@querylist){my $boldword="<b>$word</b>"; $description =~ s/$word/$boldword/gei;}
$link =~ m,href="(.*?)">,mx;
my $redirect = $1;
$link =~ m,">(.*?)</a></b>,mx;
my $title = $1;
foreach $word (@querylist){my $boldword="<b>$word</b>"; $title =~ s/$word/$boldword/gei;}
$link =~ m,1>(.*?)</font>,mix;
my $url = $1;
foreach $word (@querylist){my $boldword="<b>$word</b>"; $url =~ s/$word/$boldword/gei;}


Last edited by:

scottward: Nov 22, 2001, 6:30 AM
Quote Reply
Re: [scottward] Parsing an XML feed In reply to
Hi,

I've made a start:- installed XML::Parser, upgraded to perl 5.6.1 (to get expat.pm working) and cobbled the first bit of code to fetch an XML feed and parse it:

Quote:
my ($websearch);
use LWP::Simple;
$websearch = get("http://p.moreover.com/cgi-local/page?index_webdeveloper+xml");

if ($websearch) {
use XML::Parser;
my $xmlparse = new XML::Parser(ErrorContext => 2);
$xmlparse->parse($websearch);
print $xmlparse;

When printed, $xmlparse is as follows:
XML::Parser=HASH(0x83639d0)

I assumed that by using XML::Parser it would print out the data from each element within the XML document? I wasn't expecting the 'HASH' thingy!!

To really show my ignorance, what is a HASH and how do you covert it/break it down into readable/useable data strings, or even print out its contents? (I did search but gave up after a couple hours!)

I'm thinking that a sub routine to format the xml data would be useful and flexible for a variety of different feeds, and could be used to trigger formatting based on the xml tag that is being looked at.

I suppose what I'm really looking for is a simple breakdown of the xml data into its 'tag' and 'data', so I can format my output based on the items in the feed, e.g.;

if ($xml_tag = "<url>") {$xml_results .= $xml_data formatted as a URL}
if ($xml_tag = "<title>") {$xml_results .= $xml_data formatted for link title}

This way, if I change XML feeds I can simply update the 'triggers' to suit the new feed and the actual output I produce from the feed will remain the same.

Here's a basic low-down on what I'm trying to do (if it helps visualise my idea):

• get xml feed (LWP::Simple)
• parse xml feed (XML::Parser)
• run through xml feed data and extract+format all or selected data elements (sub xml_extract)
• put formatted elements into output and return ($output)
• display results (xmlresults -> $output)

I hope this makes some sense, and any help would be much appreciated.

Thanks.

All the best
Shaun
Quote Reply
Re: [qango] Parsing an XML feed In reply to
Do these help?

http://wwwx.netheaven.com/...mlparser/Parser.html
http://www.xml.com/.../98/09/xml-perl.html

Cheers

- wil
Quote Reply
Re: [qango] Parsing an XML feed In reply to
>>To really show my ignorance, what is a HASH and how do you covert it/break it down into readable/useable data strings, or even print out its contents? (I did search but gave up after a couple hours!)
<<

HASH means that $xmlparse is a hash reference.

Instead of print $xmlparse; try.....

print $xmlparse->{$_} foreach keys %{$xmlparse};

Last edited by:

PaulW: Nov 26, 2001, 9:18 AM