Gossamer Forum: General: Perl Programming: Extracting text from a page

Jun 6, 2005, 9:32 AM

Watts

Veteran (1141 posts)

Jun 6, 2005, 9:32 AM

Post #1 of 5

Shortcut

Extracting text from a page

As I loop through a text file I need to grab some text and assign it's value to a variable. How do I tell Perl to perform the following task:

<tr><td>Monday</td><td>1.500</td></tr>
<tr><td>Tuesday</td><td>2.870</td></tr>
<tr><td>Wednesday</td><td>3.600</td></tr>
<tr><td>Thursday</td><td>4.590</td></tr>
<tr><td>Friday</td><td>3.125</td></tr>

If line contains Wednesday, then go to the next <td> tag on that line and pull the number (or text) until the next closing </td> tag (or some other type of "delimiting" factor - maybe another word or bit of text).

this is what I'm thinking it should be ($Line being part of the foreach or while statement)

if ($Line =~ /wednesday/i) {
$Line ????? $1;
$wedRate = $1;
}

(where "print $wedRate" would result in 3.600)

It's the ???? that I'm not sure of - I've seen some examples but I did not understand them.

If anyone can give an example (and explain each step of what it's doing) I'd appreciate the help.

Jun 7, 2005, 2:08 AM

Andy

Veteran / Moderator (18441 posts)

Jun 7, 2005, 2:08 AM

Post #2 of 5

Shortcut

Re: [Watts] Extracting text from a page In reply to

Hi,

Something like this possibly?

Code:
my $text = qq{ 
<tr><td>Monday</td><td>1.500</td></tr> 
<tr><td>Tuesday</td><td>2.870</td></tr> 
<tr><td>Wednesday</td><td>3.600</td></tr> 
<tr><td>Thursday</td><td>4.590</td></tr> 
<tr><td>Friday</td><td>3.125</td></tr>  
}; 

my $start = 0; 
my $values; 
while ($text =~ m/\Q<tr><td>\E(.*?)\Q<\/td><td>\E(.*?)\Q<\/td><\/tr>\E\n/i) { 
  $values->{$1} = $2; 
} 

print $values->{thursday};

That *should* work (untested!).

Hope that helps.

Cheers

Andy (mod)
andy@ultranerds.co.uk

Jun 8, 2005, 9:07 AM

Watts

Veteran (1141 posts)

Jun 8, 2005, 9:07 AM

Post #3 of 5

Shortcut

Re: [Andy] Extracting text from a page In reply to

Thanks for the help! I've hacked it a bit to suit my needs and have a simple question.

I have this bit of Perl (which returns the following: "RealEstateJournal | Key Rates")

Code:
open(RATES, "<file1.txt") or die; 

while (<RATES>) { 
$Line = $_; 

$Line =~ s/\n//g;  #strip out new lines 

if ($Line =~ m/\Q<head>\E(.*?)\Q<title>\E(.*?)\Q<\/title>\E/i) {    
$monRate = $2; 
print $monRate; 
} 
}

This sample works (no hard return)
-------------- file1.txt ---------
<html><head><title>RealEstateJournal | Key Rates</title>
-------------- file1.txt ---------

This sample does not work (hard return)
-------------- file1.txt ---------
<html><head>
<title>RealEstateJournal | Key Rates</title>
-------------- file1.txt ---------

However if I don't open the text file and simply plug in the text like this:

Code:
my $Line = qq{  
<html><head> 
<title>RealEstateJournal | Key Rates</title>  
};   

$Line =~ s/\n//g; 

if ($Line =~ m/\Q<head>\E(.*?)\Q<title>\E(.*?)\Q<\/title>\E/i) {    
$monRate = $2; 
print $monRate; 
}

Then the new lines are stripped out properly and the hard return is ignored.

I think I've narrowed the problem down to this line (when reading the file):

$Line = $_;

If you do a "print $Line" (after stripping out the returns) then the display shows everything on one line, but the perl code above doesn't work.

Any suggestions?

Jun 8, 2005, 9:15 AM

Andy

Veteran / Moderator (18441 posts)

Jun 8, 2005, 9:15 AM

Post #4 of 5

Shortcut

Re: [Watts] Extracting text from a page In reply to

Hi,

You could try changing;

if ($Line =~ m/\Q<head>\E(.*?)\Q<title>\E(.*?)\Q<\/title>\E/i) {

to;

if ($Line =~ m/\Q<head>\E\n+(.*?)\Q<title>\E\n+(.*?)\Q<\/title>\E/i) {

..or even better, without the <head> part at all;

if ($Line =~ m/\Q\n+(.*?)\Q<title>\E\n+(.*?)\Q\n+<\/title>\E/i) {

..which should work with;

Code:
<html><head>  
<title>RealEstateJournal | Key Rates</title>

..and;

Code:
<html><head>  
<title> 
  RealEstateJournal | Key Rates 
</title>

Hope that helps =)

Cheers

Andy (mod)
andy@ultranerds.co.uk

Jun 8, 2005, 10:37 AM

Watts

Veteran (1141 posts)

Jun 8, 2005, 10:37 AM

Post #5 of 5

Shortcut

Re: [Andy] Extracting text from a page In reply to

Thanks a million! You got me pointed in the right direction. After thinking about it for a bit I figured it out.

This works:

Code:
open(RATES, "<file1.txt") or die; 

while (<RATES>) { 
$Line .= $_; 
$Line =~ s/\n//g; 

if ($Line =~ m/\Q<head>\E(.*?)\Q<title>\E(.*?)\Q<\/title>\E/i) { 
$monRate = $2; 
} 
} 
print $monRate;

(changes noted in red - added period and moved print statement)

This will allow me to read a file and pick out bits of it in between self-defined "delimiters".

Thanks for all your help!