Gossamer Forum
Home : General : Perl Programming :

return range of text around search term

Quote Reply
return range of text around search term
What I'm trying to do is:

1) extract text from betwen HTML <BODY> tags,
2a) bold text between anchor tags if the search term(s) is in the URL
2b) bold any other text containing the search term(s)
3) remove all HTML and convert the bold markup to HTML
4) return ranges of text surrounding the search term in an array

I'm certain I need to use splice(), but just not how go about selecting the ranges.

Here is what I have so far:

sub highlight {
my ($str, $q) = @_;
my $re = '(' . ( join '|', map { quotemeta } split / /, $q ) . ')';

my @parts = split /((?:<[^>]*>[^<]*<[^>]*>))/, $str;

foreach my $part (@parts) {
if ($part =~ /$re/) {
if ($part =~ /^<a/i) {
#highlight the link description
$part =~ s,(<[^>]*>([^<]*)<[^>]*>),\[b\]$2\[\/b\],;
else {
#highlight keywords
$part =~ s,$re,\[b\]$1\[\/b\],gi;

$str = join "", @parts;

$str =~ s/<[^>]*>//g;
$str =~ s/\[b\]/<b>/g;
$str =~ s/\[\/b\]/<\/b>/g;

return $str;

If there's a module on CPAN that does this that would be cool too!

Limecat is not pleased.
Quote Reply
Re: [fuzzy logic] return range of text around search term In reply to
I'm not sure why you want an array...
There are modules on CPAN for parsing HTML, which it seems like is sort of what you want.
What I would probably do, although maybe this isn't what you want, is something like:
  1. Look at the title first (of course), and then perhaps keywords and description
  2. Remove everything before (and after) the body, if possible
  3. Remove all tags (perhaps replace <p> and <br> with \n and then replace \n in the results with <br>)
  4. Determine a selection range based on the size of the match
  5. If there's a match, bold the search query
## assume $query contains the query and $sel contains the amount of text to
## return
## assume $text contains the text we are searching in
my $outer = int (($sel - $query) / 2); # Select before and after
if (my ($match) = $text =~ /(.{0,$outer}\Q$query\E.{0,$outer})/) { # /o?
$match =~ s{\Q$query\E}{<b>$query</b>}; # make it bold
Of course, this could probably be made more efficient using 'index' and 'substr' instead.
Quote Reply
Re: [mkp] return range of text around search term In reply to
actually, I do want ALL HTML removed (except for what I add). What I was hoping to acomplish was something similar to the search results you get when you search at Google Groups. Anyway, the reason I wanted HTML stripped out as all I need is the text, which is being pulled off other web sites. Preserving format is not prefered, given that the end result was to be limited to just 255 characters of actual text.

I'm going a completely different approach now, though. Since this is for a Pingback implementation I'm writting for a Gossamer Links bloging plugin, I'm instead lifting article descriptions directly from embedded Trackback RDF on the page. It would be up to the author on the receiving end of the Pingback to visit the site manually if the RDF extraction failed or if they want to add context to the clip of text shown in their Pingback/Trackback section.

Limecat is not pleased.