Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

[jira] Updated: (LUCENE-1522) another highlighter

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


jira at apache

Jun 28, 2009, 4:20 AM

Post #1 of 5 (329 views)
Permalink
[jira] Updated: (LUCENE-1522) another highlighter

[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Koji Sekiguchi updated LUCENE-1522:
-----------------------------------

Attachment: LUCENE-1522.patch

Added more comment and index time synonym support and its test cases.

> another highlighter
> -------------------
>
> Key: LUCENE-1522
> URL: https://issues.apache.org/jira/browse/LUCENE-1522
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/highlighter
> Reporter: Koji Sekiguchi
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 2.9
>
> Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch, LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
> // fieldName="content", fragCharSize=100, numFragments=3
> String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 );
> if( fragments != null ){
> for( String fragment : fragments )
> System.out.println( fragment );
> }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org


jira at apache

Jun 30, 2009, 10:08 AM

Post #2 of 5 (284 views)
Permalink
[jira] Updated: (LUCENE-1522) another highlighter [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Koji Sekiguchi updated LUCENE-1522:
-----------------------------------

Attachment: LUCENE-1522.patch

I added package.html to explain the algorithm of H2. :)

> another highlighter
> -------------------
>
> Key: LUCENE-1522
> URL: https://issues.apache.org/jira/browse/LUCENE-1522
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/highlighter
> Reporter: Koji Sekiguchi
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 2.9
>
> Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch, LUCENE-1522.patch, LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
> // fieldName="content", fragCharSize=100, numFragments=3
> String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 );
> if( fragments != null ){
> for( String fragment : fragments )
> System.out.println( fragment );
> }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org


jira at apache

Jul 6, 2009, 6:43 PM

Post #3 of 5 (245 views)
Permalink
[jira] Updated: (LUCENE-1522) another highlighter [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Koji Sekiguchi updated LUCENE-1522:
-----------------------------------

Attachment: LUCENE-1522.patch

Thank you for your advice, Michael.

bq. because they test multi-valued fields

Right.

I commentted out the part of LUCENE-1448 dependencies, ie, test cases for multi valued:

{code}
/*
* ----------------------------------
* THIS TEST DEPENDS ON LUCENE-1448
* UNCOMMENT WHEN IT IS COMMITTED.
* ----------------------------------
public void testXXXXXXXMVB() throws Exception {
:
}
*/
{code}

All tests pass without LUCENE-1448.

> another highlighter
> -------------------
>
> Key: LUCENE-1522
> URL: https://issues.apache.org/jira/browse/LUCENE-1522
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/highlighter
> Reporter: Koji Sekiguchi
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 2.9
>
> Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch, LUCENE-1522.patch, LUCENE-1522.patch, LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
> // fieldName="content", fragCharSize=100, numFragments=3
> String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 );
> if( fragments != null ){
> for( String fragment : fragments )
> System.out.println( fragment );
> }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org


jira at apache

Jul 7, 2009, 11:48 AM

Post #4 of 5 (244 views)
Permalink
[jira] Updated: (LUCENE-1522) another highlighter [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1522:
---------------------------------------

Attachment: LUCENE-1522.patch

Patch looks good, thanks Koji! All tests now pass for me.

I attached a new patch, that updates the website w/ references to highlighter2. I plan to commit in a day or two.

> another highlighter
> -------------------
>
> Key: LUCENE-1522
> URL: https://issues.apache.org/jira/browse/LUCENE-1522
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/highlighter
> Reporter: Koji Sekiguchi
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 2.9
>
> Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch, LUCENE-1522.patch, LUCENE-1522.patch, LUCENE-1522.patch, LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
> // fieldName="content", fragCharSize=100, numFragments=3
> String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 );
> if( fragments != null ){
> for( String fragment : fragments )
> System.out.println( fragment );
> }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org


jira at apache

Jul 9, 2009, 5:03 AM

Post #5 of 5 (221 views)
Permalink
[jira] Updated: (LUCENE-1522) another highlighter [In reply to]

[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Koji Sekiguchi updated LUCENE-1522:
-----------------------------------

Attachment: LUCENE-1522.patch

renamed Highlighter2 to FastVectorHighlighter :)

> another highlighter
> -------------------
>
> Key: LUCENE-1522
> URL: https://issues.apache.org/jira/browse/LUCENE-1522
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/highlighter
> Reporter: Koji Sekiguchi
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 2.9
>
> Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch, LUCENE-1522.patch, LUCENE-1522.patch, LUCENE-1522.patch, LUCENE-1522.patch, LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
> // fieldName="content", fragCharSize=100, numFragments=3
> String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 );
> if( fragments != null ){
> for( String fragment : fragments )
> System.out.println( fragment );
> }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-dev-help[at]lucene.apache.org

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.