Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Stemming - limited index expansion

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


paul at metajure

Jun 12, 2012, 12:07 PM

Post #1 of 4 (420 views)
Permalink
Stemming - limited index expansion

As others have previously proposed on this list, I am interesting in inserting a second token at some positions in my index. I'll call this Limited Index Expansion.
I want to retain the original token, so that I can score an original word that matches in a text better than just any synonym/stem etc. Maybe I'll even do this with payloads (on the 2nd token?).
If I didn't keep the original word all I would be doing is a limited index time "reduction". Saving the original word and sometimes a lemma/stem (or something else), means I anticipate at most two tokens at a position in the index.

I couldn't find a nearly-right high-level Filter that I could use to add logic to call a stemmer and conditionally add another token. Any suggestions?
One idea I had is that adding a second token is much like what a SynonymFilter does, but yikes I was starting to grok PendingInputs, PendingOutputs,
but wasn't getting very far reading through SynonymMap and its BytesRefHash etc. Obviously it is written to be very good with memory very and fast, but it looks a bit tricky to extend for other sources of "synonyms". It is too bad that the get synonym part of the operation is not encapsulated in something pluggable or overridable, so I could just return an appropriate array of CharRefs. The SynonymFilter is final anyway.

Can anyone point me toward any existing high-level filter that I could use by sub-classing, modifying, plugging, or just as a good example to which I might add my additional code to add another token?
Building Filters is new to me, but right now nothing is jumping out at me as a basis for such a Filter. Any suggestions? Did I miss something in core or contrib?
Is there some other combination of buffering, copying, sinking etc filters that I'm missing what I should use to build a filter chain that would aid this process?

-Paul


jack at basetechnology

Jun 12, 2012, 1:14 PM

Post #2 of 4 (410 views)
Permalink
Re: Stemming - limited index expansion [In reply to]

I don't completely follow precisely what you want to do, but the
WordDelimiterFilter is an example of a token filter that outputs an extra
token at the same position, such as with its CATENATE_ALL/WORDS/NUMBERS
options.

https://builds.apache.org/job/Lucene-trunk/javadoc/analyzers-common/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html

For example, given the input "wi-fi", it would output "wi" with position 0,
"fi" with position 1, and "wifi" also with position 0.

Or, with its PRESERVE_ORIGINAL option, that same input would output "wi" at
0, "fi" at 1, and "wi-fi" at 0.

That said, maybe you could clarify your specific intent with an example.
Maybe you simple want to internally call some existing stemmer filter and
output both the original and stemmed term at the same location?

-- Jack Krupansky

-----Original Message-----
From: Paul Hill
Sent: Tuesday, June 12, 2012 3:07 PM
To: java-user [at] lucene
Subject: Stemming - limited index expansion

As others have previously proposed on this list, I am interesting in
inserting a second token at some positions in my index. I'll call this
Limited Index Expansion.
I want to retain the original token, so that I can score an original word
that matches in a text better than just any synonym/stem etc. Maybe I'll
even do this with payloads (on the 2nd token?).
If I didn't keep the original word all I would be doing is a limited index
time "reduction". Saving the original word and sometimes a lemma/stem (or
something else), means I anticipate at most two tokens at a position in the
index.

I couldn't find a nearly-right high-level Filter that I could use to add
logic to call a stemmer and conditionally add another token. Any
suggestions?
One idea I had is that adding a second token is much like what a
SynonymFilter does, but yikes I was starting to grok PendingInputs,
PendingOutputs,
but wasn't getting very far reading through SynonymMap and its BytesRefHash
etc. Obviously it is written to be very good with memory very and fast, but
it looks a bit tricky to extend for other sources of "synonyms". It is too
bad that the get synonym part of the operation is not encapsulated in
something pluggable or overridable, so I could just return an appropriate
array of CharRefs. The SynonymFilter is final anyway.

Can anyone point me toward any existing high-level filter that I could use
by sub-classing, modifying, plugging, or just as a good example to which I
might add my additional code to add another token?
Building Filters is new to me, but right now nothing is jumping out at me as
a basis for such a Filter. Any suggestions? Did I miss something in core
or contrib?
Is there some other combination of buffering, copying, sinking etc filters
that I'm missing what I should use to build a filter chain that would aid
this process?

-Paul


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


paul at metajure

Jun 12, 2012, 4:43 PM

Post #3 of 4 (407 views)
Permalink
RE: Stemming - limited index expansion [In reply to]

Thanks for the reply.

> -----Original Message-----
> From: Jack Krupansky [mailto:jack [at] basetechnology]
> Sent: Tuesday, June 12, 2012 1:14 PM
> To: java-user [at] lucene
> Subject: Re: Stemming - limited index expansion
>
> I don't completely follow precisely what you want to do, but the WordDelimiterFilter is an example of a
> token filter that outputs an extra token at the same position, such as with its
> CATENATE_ALL/WORDS/NUMBERS options.

Thanks for directing me to that. I'm currently using 3.4., it doesn't appear in the code base of 3.6.
If it doesn't show up until 4.0+ (your link is actually 5.0!), I know that
" Terms are no longer required to be character based. Lucene views a term as an arbitrary byte[]"
-- https://builds.apache.org/job/Lucene-trunk/javadoc/changes/Changes.html#4.0.0-alpha.api_changes
But hopefully it at the right level to suggest how would be done using the old CharRef instead of whatever the new stuff uses (ByteRef?).
I'll take a look.

> Maybe you simple want to internally call some existing stemmer filter and output both the original and
> stemmed term at the same location?

Yes, that is very close to what I want to do, possibly only with the addition of only doing stemming on a limited set of all words (but more than just plurals).

-Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


jack at basetechnology

Jun 12, 2012, 4:57 PM

Post #4 of 4 (410 views)
Permalink
Re: Stemming - limited index expansion [In reply to]

I forgot about the Solr/Lucene code shuffling. Back in 3.4, WDF was in Solr
rather than Lucene. Here's the code:

http://svn.apache.org/viewvc/lucene/dev/branches/lucene_solr_3_4/solr/core/src/java/org/apache/solr/analysis/WordDelimiterFilter.java?revision=1166268&view=markup

-- Jack Krupansky

-----Original Message-----
From: Paul Hill
Sent: Tuesday, June 12, 2012 7:43 PM
To: java-user [at] lucene
Subject: RE: Stemming - limited index expansion

Thanks for the reply.

> -----Original Message-----
> From: Jack Krupansky [mailto:jack [at] basetechnology]
> Sent: Tuesday, June 12, 2012 1:14 PM
> To: java-user [at] lucene
> Subject: Re: Stemming - limited index expansion
>
> I don't completely follow precisely what you want to do, but the
> WordDelimiterFilter is an example of a
> token filter that outputs an extra token at the same position, such as
> with its
> CATENATE_ALL/WORDS/NUMBERS options.

Thanks for directing me to that. I'm currently using 3.4., it doesn't appear
in the code base of 3.6.
If it doesn't show up until 4.0+ (your link is actually 5.0!), I know that
" Terms are no longer required to be character based. Lucene views a term
as an arbitrary byte[]"
--
https://builds.apache.org/job/Lucene-trunk/javadoc/changes/Changes.html#4.0.0-alpha.api_changes
But hopefully it at the right level to suggest how would be done using the
old CharRef instead of whatever the new stuff uses (ByteRef?).
I'll take a look.

> Maybe you simple want to internally call some existing stemmer filter and
> output both the original and
> stemmed term at the same location?

Yes, that is very close to what I want to do, possibly only with the
addition of only doing stemming on a limited set of all words (but more than
just plurals).

-Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.