Gossamer Forum
Home : General : Perl Programming :

Extracting text from a page

Quote Reply
Extracting text from a page
As I loop through a text file I need to grab some text and assign it's value to a variable. How do I tell Perl to perform the following task:

<tr><td>Monday</td><td>1.500</td></tr>
<tr><td>Tuesday</td><td>2.870</td></tr>
<tr><td>Wednesday</td><td>3.600</td></tr>
<tr><td>Thursday</td><td>4.590</td></tr>
<tr><td>Friday</td><td>3.125</td></tr>

If line contains Wednesday, then go to the next <td> tag on that line and pull the number (or text) until the next closing </td> tag (or some other type of "delimiting" factor - maybe another word or bit of text).

this is what I'm thinking it should be ($Line being part of the foreach or while statement)

if ($Line =~ /wednesday/i) {
$Line ????? $1;
$wedRate = $1;
}

(where "print $wedRate" would result in 3.600)

It's the ???? that I'm not sure of - I've seen some examples but I did not understand them.

If anyone can give an example (and explain each step of what it's doing) I'd appreciate the help.
Quote Reply
Re: [Watts] Extracting text from a page In reply to
Hi,

Something like this possibly?

Code:
my $text = qq{
<tr><td>Monday</td><td>1.500</td></tr>
<tr><td>Tuesday</td><td>2.870</td></tr>
<tr><td>Wednesday</td><td>3.600</td></tr>
<tr><td>Thursday</td><td>4.590</td></tr>
<tr><td>Friday</td><td>3.125</td></tr>
};

my $start = 0;
my $values;
while ($text =~ m/\Q<tr><td>\E(.*?)\Q<\/td><td>\E(.*?)\Q<\/td><\/tr>\E\n/i) {
$values->{$1} = $2;
}

print $values->{thursday};

That *should* work (untested!).

Hope that helps.

Cheers

Andy (mod)
andy@ultranerds.co.uk


IMPORTANT: I've now moved to ultranerds.co.uk, and the .com will no longer work!
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package (plugins total "value" $3,325 & rising, for just $350)| GLinks ULTRA Package PRO (plugins total "value" $5,625 & rising, for just $500)
Support Forum | Links SQL Plugins | DMOZ Dumps | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Compare our different Plugin packages *new* Free CSS Templates
Quote Reply
Re: [Andy] Extracting text from a page In reply to
Thanks for the help! I've hacked it a bit to suit my needs and have a simple question.

I have this bit of Perl (which returns the following: "RealEstateJournal | Key Rates")

Code:
open(RATES, "<file1.txt") or die;

while (<RATES>) {
$Line = $_;

$Line =~ s/\n//g; #strip out new lines

if ($Line =~ m/\Q<head>\E(.*?)\Q<title>\E(.*?)\Q<\/title>\E/i) {
$monRate = $2;
print $monRate;
}
}

This sample works (no hard return)
-------------- file1.txt ---------
<html><head><title>RealEstateJournal | Key Rates</title>
-------------- file1.txt ---------

This sample does not work (hard return)
-------------- file1.txt ---------
<html><head>
<title>RealEstateJournal | Key Rates</title>
-------------- file1.txt ---------

However if I don't open the text file and simply plug in the text like this:

Code:
my $Line = qq{
<html><head>
<title>RealEstateJournal | Key Rates</title>
};

$Line =~ s/\n//g;

if ($Line =~ m/\Q<head>\E(.*?)\Q<title>\E(.*?)\Q<\/title>\E/i) {
$monRate = $2;
print $monRate;
}

Then the new lines are stripped out properly and the hard return is ignored.

I think I've narrowed the problem down to this line (when reading the file):

$Line = $_;

If you do a "print $Line" (after stripping out the returns) then the display shows everything on one line, but the perl code above doesn't work.

Any suggestions?
Quote Reply
Re: [Watts] Extracting text from a page In reply to
Hi,

You could try changing;

if ($Line =~ m/\Q<head>\E(.*?)\Q<title>\E(.*?)\Q<\/title>\E/i) {

to;

if ($Line =~ m/\Q<head>\E\n+(.*?)\Q<title>\E\n+(.*?)\Q<\/title>\E/i) {

..or even better, without the <head> part at all;

if ($Line =~ m/\Q\n+(.*?)\Q<title>\E\n+(.*?)\Q\n+<\/title>\E/i) {

..which should work with;

Code:
<html><head>
<title>RealEstateJournal | Key Rates</title>

..and;

Code:
<html><head>
<title>
RealEstateJournal | Key Rates
</title>

Hope that helps =)

Cheers

Andy (mod)
andy@ultranerds.co.uk


IMPORTANT: I've now moved to ultranerds.co.uk, and the .com will no longer work!
Want to give me something back for my help? Please see my Amazon Wish List
GLinks ULTRA Package (plugins total "value" $3,325 & rising, for just $350)| GLinks ULTRA Package PRO (plugins total "value" $5,625 & rising, for just $500)
Support Forum | Links SQL Plugins | DMOZ Dumps | UltraNerds | ULTRAGLobals Plugin | Pre-Made Template Sets | FREE GLinks Plugins!
Compare our different Plugin packages *new* Free CSS Templates
Quote Reply
Re: [Andy] Extracting text from a page In reply to
Thanks a million! You got me pointed in the right direction. After thinking about it for a bit I figured it out.

This works:
Code:
open(RATES, "<file1.txt") or die;

while (<RATES>) {
$Line .= $_;
$Line =~ s/\n//g;

if ($Line =~ m/\Q<head>\E(.*?)\Q<title>\E(.*?)\Q<\/title>\E/i) {
$monRate = $2;
}
}
print $monRate;
(changes noted in red - added period and moved print statement)

This will allow me to read a file and pick out bits of it in between self-defined "delimiters".

Thanks for all your help!