Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

HTML tags and Lucene highlighting

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


bodymoves at gmail

Apr 5, 2012, 10:34 AM

Post #1 of 6 (293 views)
Permalink
HTML tags and Lucene highlighting

Hello,

I currently use Lucene version 3.0...probably need to upgrade to a more
current version soon.
The problem that I have is when I test search for a an HTML tag (ex.
<strong>), Lucene returns
the highlighted HTML tag ~ which is what I DO NOT want. Is there a way to
"filter" HTML tags?
I have read up on HTMLStripChar filter (packaged with Solr) and wondered if
this is the way to go?

Any help will be greatly appreciated,
Thanks


sarowe at syr

Apr 5, 2012, 12:24 PM

Post #2 of 6 (287 views)
Permalink
RE: HTML tags and Lucene highlighting [In reply to]

Hi okayndc,

What *do* you want?

Steve

-----Original Message-----
From: okayndc [mailto:bodymoves [at] gmail]
Sent: Thursday, April 05, 2012 1:34 PM
To: java-user [at] lucene
Subject: HTML tags and Lucene highlighting

Hello,

I currently use Lucene version 3.0...probably need to upgrade to a more current version soon.
The problem that I have is when I test search for a an HTML tag (ex.
<strong>), Lucene returns
the highlighted HTML tag ~ which is what I DO NOT want. Is there a way to "filter" HTML tags?
I have read up on HTMLStripChar filter (packaged with Solr) and wondered if this is the way to go?

Any help will be greatly appreciated,
Thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


bodymoves at gmail

Apr 5, 2012, 12:36 PM

Post #3 of 6 (284 views)
Permalink
Re: HTML tags and Lucene highlighting [In reply to]

Hello,

I want to ignore HTML tags within a search. ~ I should not be able to
search for a HTML tag (ex. <strong>) and get back the highlighted HTML tag
(ex. <span class="highlighted"><strong></span>) in a result set.

Thanks


On Thu, Apr 5, 2012 at 3:24 PM, Steven A Rowe <sarowe [at] syr> wrote:

> Hi okayndc,
>
> What *do* you want?
>
> Steve
>
> -----Original Message-----
> From: okayndc [mailto:bodymoves [at] gmail]
> Sent: Thursday, April 05, 2012 1:34 PM
> To: java-user [at] lucene
> Subject: HTML tags and Lucene highlighting
>
> Hello,
>
> I currently use Lucene version 3.0...probably need to upgrade to a more
> current version soon.
> The problem that I have is when I test search for a an HTML tag (ex.
> <strong>), Lucene returns
> the highlighted HTML tag ~ which is what I DO NOT want. Is there a way to
> "filter" HTML tags?
> I have read up on HTMLStripChar filter (packaged with Solr) and wondered
> if this is the way to go?
>
> Any help will be greatly appreciated,
> Thanks
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


sarowe at syr

Apr 5, 2012, 12:44 PM

Post #4 of 6 (286 views)
Permalink
RE: HTML tags and Lucene highlighting [In reply to]

okayndc,

A field configured to use HTMLStripCharFilter as part of its index-time analyzer will strip out HTML tags before index terms are created by the tokenizer, so HTML tags will not be put into the index. As a result, queries for HTML tags cannot match the original documents' HTML tags (in the field configured to use HTMLStripCharFilter, anyway).

So HTMLStripCharFilter should do what you want.

Steve

From: okayndc [mailto:bodymoves [at] gmail]
Sent: Thursday, April 05, 2012 3:36 PM
To: Steven A Rowe
Cc: java-user [at] lucene
Subject: Re: HTML tags and Lucene highlighting

Hello,

I want to ignore HTML tags within a search. ~ I should not be able to search for a HTML tag (ex. <strong>) and get back the highlighted HTML tag (ex. <span class="highlighted"><strong></span>) in a result set.

Thanks

On Thu, Apr 5, 2012 at 3:24 PM, Steven A Rowe <sarowe [at] syr<mailto:sarowe [at] syr>> wrote:
Hi okayndc,

What *do* you want?

Steve

-----Original Message-----
From: okayndc [mailto:bodymoves [at] gmail<mailto:bodymoves [at] gmail>]
Sent: Thursday, April 05, 2012 1:34 PM
To: java-user [at] lucene<mailto:java-user [at] lucene>
Subject: HTML tags and Lucene highlighting

Hello,

I currently use Lucene version 3.0...probably need to upgrade to a more current version soon.
The problem that I have is when I test search for a an HTML tag (ex.
<strong>), Lucene returns
the highlighted HTML tag ~ which is what I DO NOT want. Is there a way to "filter" HTML tags?
I have read up on HTMLStripChar filter (packaged with Solr) and wondered if this is the way to go?

Any help will be greatly appreciated,
Thanks
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene<mailto:java-user-unsubscribe [at] lucene>
For additional commands, e-mail: java-user-help [at] lucene<mailto:java-user-help [at] lucene>


bodymoves at gmail

Apr 5, 2012, 1:34 PM

Post #5 of 6 (287 views)
Permalink
Re: HTML tags and Lucene highlighting [In reply to]

I want to retain the formatted HTML in a result but, want to ignore (or
filter out) HTML tags in a search, if this makes sense?

On Thu, Apr 5, 2012 at 3:44 PM, Steven A Rowe <sarowe [at] syr> wrote:

> okayndc,
>
> A field configured to use HTMLStripCharFilter as part of its index-time
> analyzer will strip out HTML tags before index terms are created by the
> tokenizer, so HTML tags will not be put into the index. As a result,
> queries for HTML tags cannot match the original documents' HTML tags (in
> the field configured to use HTMLStripCharFilter, anyway).
>
> So HTMLStripCharFilter should do what you want.
>
> Steve
>
> From: okayndc [mailto:bodymoves [at] gmail]
> Sent: Thursday, April 05, 2012 3:36 PM
> To: Steven A Rowe
> Cc: java-user [at] lucene
> Subject: Re: HTML tags and Lucene highlighting
>
> Hello,
>
> I want to ignore HTML tags within a search. ~ I should not be able to
> search for a HTML tag (ex. <strong>) and get back the highlighted HTML tag
> (ex. <span class="highlighted"><strong></span>) in a result set.
>
> Thanks
>
> On Thu, Apr 5, 2012 at 3:24 PM, Steven A Rowe <sarowe [at] syr<mailto:
> sarowe [at] syr>> wrote:
> Hi okayndc,
>
> What *do* you want?
>
> Steve
>
> -----Original Message-----
> From: okayndc [mailto:bodymoves [at] gmail<mailto:bodymoves [at] gmail>]
> Sent: Thursday, April 05, 2012 1:34 PM
> To: java-user [at] lucene<mailto:java-user [at] lucene>
> Subject: HTML tags and Lucene highlighting
>
> Hello,
>
> I currently use Lucene version 3.0...probably need to upgrade to a more
> current version soon.
> The problem that I have is when I test search for a an HTML tag (ex.
> <strong>), Lucene returns
> the highlighted HTML tag ~ which is what I DO NOT want. Is there a way to
> "filter" HTML tags?
> I have read up on HTMLStripChar filter (packaged with Solr) and wondered
> if this is the way to go?
>
> Any help will be greatly appreciated,
> Thanks
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene<mailto:
> java-user-unsubscribe [at] lucene>
> For additional commands, e-mail: java-user-help [at] lucene<mailto:
> java-user-help [at] lucene>
>
>


koji at r

Apr 5, 2012, 2:52 PM

Post #6 of 6 (288 views)
Permalink
Re: HTML tags and Lucene highlighting [In reply to]

(12/04/06 2:34), okayndc wrote:
> Hello,
>
> I currently use Lucene version 3.0...probably need to upgrade to a more
> current version soon.
> The problem that I have is when I test search for a an HTML tag (ex.
> <strong>), Lucene returns
> the highlighted HTML tag ~ which is what I DO NOT want. Is there a way to
> "filter" HTML tags?
> I have read up on HTMLStripChar filter (packaged with Solr) and wondered if
> this is the way to go?
>
> Any help will be greatly appreciated,
> Thanks
>

There is a way to encode HTML tags:

https://builds.apache.org/job/Lucene-3.x/javadoc/contrib-highlighter/org/apache/lucene/search/highlight/SimpleHTMLEncoder.html

koji
--
Query Log Visualizer for Apache Solr
http://soleami.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.