Gossamer Forum
Home : General : Perl Programming :

Extracting text from a page

Quote Reply
Extracting text from a page
As I loop through a text file I need to grab some text and assign it's value to a variable. How do I tell Perl to perform the following task:

<tr><td>Monday</td><td>1.500</td></tr>
<tr><td>Tuesday</td><td>2.870</td></tr>
<tr><td>Wednesday</td><td>3.600</td></tr>
<tr><td>Thursday</td><td>4.590</td></tr>
<tr><td>Friday</td><td>3.125</td></tr>

If line contains Wednesday, then go to the next <td> tag on that line and pull the number (or text) until the next closing </td> tag (or some other type of "delimiting" factor - maybe another word or bit of text).

this is what I'm thinking it should be ($Line being part of the foreach or while statement)

if ($Line =~ /wednesday/i) {
$Line ????? $1;
$wedRate = $1;
}

(where "print $wedRate" would result in 3.600)

It's the ???? that I'm not sure of - I've seen some examples but I did not understand them.

If anyone can give an example (and explain each step of what it's doing) I'd appreciate the help.
Quote Reply
Re: [Watts] Extracting text from a page In reply to
Hi,

Something like this possibly?

Code:
my $text = qq{
<tr><td>Monday</td><td>1.500</td></tr>
<tr><td>Tuesday</td><td>2.870</td></tr>
<tr><td>Wednesday</td><td>3.600</td></tr>
<tr><td>Thursday</td><td>4.590</td></tr>
<tr><td>Friday</td><td>3.125</td></tr>
};

my $start = 0;
my $values;
while ($text =~ m/\Q<tr><td>\E(.*?)\Q<\/td><td>\E(.*?)\Q<\/td><\/tr>\E\n/i) {
$values->{$1} = $2;
}

print $values->{thursday};

That *should* work (untested!).

Hope that helps.

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Extracting text from a page In reply to
Thanks for the help! I've hacked it a bit to suit my needs and have a simple question.

I have this bit of Perl (which returns the following: "RealEstateJournal | Key Rates")

Code:
open(RATES, "<file1.txt") or die;

while (<RATES>) {
$Line = $_;

$Line =~ s/\n//g; #strip out new lines

if ($Line =~ m/\Q<head>\E(.*?)\Q<title>\E(.*?)\Q<\/title>\E/i) {
$monRate = $2;
print $monRate;
}
}

This sample works (no hard return)
-------------- file1.txt ---------
<html><head><title>RealEstateJournal | Key Rates</title>
-------------- file1.txt ---------

This sample does not work (hard return)
-------------- file1.txt ---------
<html><head>
<title>RealEstateJournal | Key Rates</title>
-------------- file1.txt ---------

However if I don't open the text file and simply plug in the text like this:

Code:
my $Line = qq{
<html><head>
<title>RealEstateJournal | Key Rates</title>
};

$Line =~ s/\n//g;

if ($Line =~ m/\Q<head>\E(.*?)\Q<title>\E(.*?)\Q<\/title>\E/i) {
$monRate = $2;
print $monRate;
}

Then the new lines are stripped out properly and the hard return is ignored.

I think I've narrowed the problem down to this line (when reading the file):

$Line = $_;

If you do a "print $Line" (after stripping out the returns) then the display shows everything on one line, but the perl code above doesn't work.

Any suggestions?
Quote Reply
Re: [Watts] Extracting text from a page In reply to
Hi,

You could try changing;

if ($Line =~ m/\Q<head>\E(.*?)\Q<title>\E(.*?)\Q<\/title>\E/i) {

to;

if ($Line =~ m/\Q<head>\E\n+(.*?)\Q<title>\E\n+(.*?)\Q<\/title>\E/i) {

..or even better, without the <head> part at all;

if ($Line =~ m/\Q\n+(.*?)\Q<title>\E\n+(.*?)\Q\n+<\/title>\E/i) {

..which should work with;

Code:
<html><head>
<title>RealEstateJournal | Key Rates</title>

..and;

Code:
<html><head>
<title>
RealEstateJournal | Key Rates
</title>

Hope that helps =)

Cheers

Andy (mod)
andy@ultranerds.co.uk
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package | GLinks ULTRA Package PRO
Links SQL Plugins | Website Design and SEO | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Quote Reply
Re: [Andy] Extracting text from a page In reply to
Thanks a million! You got me pointed in the right direction. After thinking about it for a bit I figured it out.

This works:
Code:
open(RATES, "<file1.txt") or die;

while (<RATES>) {
$Line .= $_;
$Line =~ s/\n//g;

if ($Line =~ m/\Q<head>\E(.*?)\Q<title>\E(.*?)\Q<\/title>\E/i) {
$monRate = $2;
}
}
print $monRate;
(changes noted in red - added period and moved print statement)

This will allow me to read a file and pick out bits of it in between self-defined "delimiters".

Thanks for all your help!