Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Payloads, Tokenizers, and Filters. Oh My!

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


pgwillia at uwaterloo

Nov 16, 2007, 7:37 PM

Post #1 of 6 (305 views)
Permalink
Payloads, Tokenizers, and Filters. Oh My!

Hi All,

I'll explain what I'm working on, and then I'll ask my two questions.

I'm working on the issue
https://issues.apache.org/jira/browse/SOLR-380 which is a feature
request that allows one to index a "Structured Document" which is
anything that can be represented by XML in order to provide more context
to hits in the result set. This allows us to do things like query the
index for "Canada" and be able to not only say that that query matched a
document titled "Some Nonsense" but also that the query term appeared on
page 7 of chapter 1. We can then take this one step further and
markup/highlight the image of this page based on our OCR and position hit.
For example:

<book title='Some Nonsense'><chapter title='One'><page name='1'>Some
text from page one of a book.</page><page name='7'>Some more text from
page seven of a book. Oh and I'm from Canada.</page></chapter></book>

I accomplished this by creating a custom Tokenizer which strips the
xml elements and stores them as a Payload at each of the Tokens created
from the character data in the input. The payload is the string that
describes the XPath at that location. So for <Canada> the payload is
"/book[title='Some Nonsense']/chapter[title='One']/page[name='7']"

The other part of this work is the SolrHighlighter which is less
important to this list. I retrieve the TermPositions for the Query's
Terms and use the TermPosition functionality to get back the payload for
the hits and build output which shows hit positions categorized by the
payload they are associated with.

QUESTION 1: Applying TokenFilters to my Tokenizer creates some strange
(in my opinion) behavior. First of all the TermPositions change and
second the Payload is removed. Is this the expected behavior, or is
this a bug? With the Payload being an "experimental feature" I can
understand if this persistence just hasn't been implemented yet. But is
it, or will it be?

In the following example I will denote a token by {pos,<term
text>,<payload>}:

input: <class name='mammalia'>Dog, and Cat</class>

XmlPayloadTokenizer:
{1,<Dog,>,</class[name='mammalia'][startPos='0']>},{2,<and>,</class[name='mammalia'][startPos='0']>},{3,<Cat>,</class[name='mammalia'][startPos='0']>}

StopFilter:
{1,<Dog,>,</class[name='mammalia'][startPos='0']>},{2,<Cat>,</class[name='mammalia'][startPos='0']>}

WordDelimiterFilter:
{1,<Dog>,<>} {2,<Cat>,</class[name='mammalia'][startPos='0']>}
LowerCaseFilter:
{1,<dog>,<>} {2,<cat>,</class[name='mammalia'][startPos='0']>}


QUESTION 2: As I explained I'm storing the String representing the
XPath of the token as the Payload (well the ByteArray of the String) of
each token. Is there a more efficient way to do this? Is this
exploiting Payload functionality and will it turn around and bite me
when I get to indexing hundreds of thousands of documents? Perhaps I
shouldn't be relying on the Payload functionality before it is deemed
not experimental?

I feel these questions are both related to Lucene proper rather than
Solr, which is why I've posted here. If you think solr-user is a better
place to post my questions let me know.

Thanks for your input!
Tricia


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


pgwillia at uwaterloo

Nov 16, 2007, 3:03 PM

Post #2 of 6 (292 views)
Permalink
Payloads, Tokenizers, and Filters. Oh My! [In reply to]

Hi All,

I'll explain what I'm working on, and then I'll ask my two questions.

I'm working on the issue
https://issues.apache.org/jira/browse/SOLR-380 which is a feature
request that allows one to index a "Structured Document" which is
anything that can be represented by XML in order to provide more context
to hits in the result set. This allows us to do things like query the
index for "Canada" and be able to not only say that that query matched a
document titled "Some Nonsense" but also that the query term appeared on
page 7 of chapter 1. We can then take this one step further and
markup/highlight the image of this page based on our OCR and position hit.

For example:

<book title='Some Nonsense'><chapter title='One'><page name='1'>Some
text from page one of a book.</page><page name='7'>Some more text from
page seven of a book. Oh and I'm from Canada.</page></chapter></book>

I accomplished this by creating a custom Tokenizer which strips the
xml elements and stores them as a Payload at each of the Tokens created
from the character data in the input. The payload is the string that
describes the XPath at that location. So for <Canada> the payload is
"/book[title='Some Nonsense']/chapter[title='One']/page[name='7']"

The other part of this work is the SolrHighlighter which is less
important to this list. I retrieve the TermPositions for the Query's
Terms and use the TermPosition functionality to get back the payload for
the hits and build output which shows hit positions categorized by the
payload they are associated with.

QUESTION 1: Applying TokenFilters to my Tokenizer creates some strange
(in my opinion) behavior. First of all the TermPositions change and
second the Payload is removed. Is this the expected behavior, or is
this a bug? With the Payload being an "experimental feature" I can
understand if this persistence just hasn't been implemented yet. But is
it, or will it be?

In the following example I will denote a token by {pos,<term
text>,<payload>}:

input: <class name='mammalia'>Dog, and Cat</class>

XmlPayloadTokenizer:
{1,<Dog,>,</class[name='mammalia'][startPos='0']>},{2,<and>,</class[name='mammalia'][startPos='0']>},{3,<Cat>,</class[name='mammalia'][startPos='0']>}
StopFilter:
{1,<Dog,>,</class[name='mammalia'][startPos='0']>},{2,<Cat>,</class[name='mammalia'][startPos='0']>}
WordDelimiterFilter:
{1,<Dog>,<>} {2,<Cat>,</class[name='mammalia'][startPos='0']>}
LowerCaseFilter:
{1,<dog>,<>} {2,<cat>,</class[name='mammalia'][startPos='0']>}


QUESTION 2: As I explained I'm storing the String representing the
XPath of the token as the Payload (well the ByteArray of the String) of
each token. Is there a more efficient way to do this? Is this
exploiting Payload functionality and will it turn around and bite me
when I get to indexing hundreds of thousands of documents? Perhaps I
shouldn't be relying on the Payload functionality before it is deemed
not experimental?

I feel these questions are both related to Lucene proper rather than
Solr, which is why I've posted here. If you think solr-user is a better
place to post my questions let me know.

Thanks for your input!
Tricia

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


gsingers at apache

Nov 17, 2007, 4:41 AM

Post #3 of 6 (291 views)
Permalink
Re: Payloads, Tokenizers, and Filters. Oh My! [In reply to]

Inline below

On Nov 16, 2007, at 6:03 PM, Tricia Williams wrote:

> Hi All,
>
> I'll explain what I'm working on, and then I'll ask my two
> questions.
>
> I'm working on the issue https://issues.apache.org/jira/browse/SOLR-380
> which is a feature request that allows one to index a "Structured
> Document" which is anything that can be represented by XML in order
> to provide more context to hits in the result set. This allows us
> to do things like query the index for "Canada" and be able to not
> only say that that query matched a document titled "Some Nonsense"
> but also that the query term appeared on page 7 of chapter 1. We
> can then take this one step further and markup/highlight the image
> of this page based on our OCR and position hit.
> For example:
>
> <book title='Some Nonsense'><chapter title='One'><page name='1'>Some
> text from page one of a book.</page><page name='7'>Some more text
> from page seven of a book. Oh and I'm from Canada.</page></chapter></
> book>
>
> I accomplished this by creating a custom Tokenizer which strips
> the xml elements and stores them as a Payload at each of the Tokens
> created from the character data in the input. The payload is the
> string that describes the XPath at that location. So for <Canada>
> the payload is "/book[title='Some Nonsense']/chapter[title='One']/
> page[name='7']"
>
> The other part of this work is the SolrHighlighter which is less
> important to this list. I retrieve the TermPositions for the
> Query's Terms and use the TermPosition functionality to get back the
> payload for the hits and build output which shows hit positions
> categorized by the payload they are associated with.
>
> QUESTION 1: Applying TokenFilters to my Tokenizer creates some
> strange (in my opinion) behavior. First of all the TermPositions
> change and second the Payload is removed. Is this the expected
> behavior, or is this a bug? With the Payload being an "experimental
> feature" I can understand if this persistence just hasn't been
> implemented yet. But is it, or will it be?
>

Do you have other TokenFilters in your Analyzer? Are you reusing the
same Token or creating a new one in your TokenFilters? If creating a
new one, you will have to set the payload as it won't be copied down.
Perhaps we should add a constructor that takes a payload. On the
other hand, I think we are going to remove the Payload object in favor
of just using the byte array.


> In the following example I will denote a token by {pos,<term
> text>,<payload>}:
>
> input: <class name='mammalia'>Dog, and Cat</class>
>
> XmlPayloadTokenizer:
> {1,<Dog,>,</class[name='mammalia'][startPos='0']>},{2,<and>,</
> class[name='mammalia'][startPos='0']>},{3,<Cat>,</
> class[name='mammalia'][startPos='0']>}
> StopFilter:
> {1,<Dog,>,</class[name='mammalia'][startPos='0']>},{2,<Cat>,</
> class[name='mammalia'][startPos='0']>}
> WordDelimiterFilter:
> {1,<Dog>,<>} {2,<Cat>,</class[name='mammalia'][startPos='0']>}
> LowerCaseFilter:
> {1,<dog>,<>} {2,<cat>,</class[name='mammalia'][startPos='0']>}
>
>
> QUESTION 2: As I explained I'm storing the String representing the
> XPath of the token as the Payload (well the ByteArray of the String)
> of each token. Is there a more efficient way to do this? Is this
> exploiting Payload functionality and will it turn around and bite me
> when I get to indexing hundreds of thousands of documents? Perhaps
> I shouldn't be relying on the Payload functionality before it is
> deemed not experimental?
>

I think this is reasonable. Micheal Busch had a nice talk at
ApacheCon on payloads that you can find at http://people.apache.org/~buschmi/apachecon/AdvancedIndexingLuceneAtlanta07.ppt

I guess you just want to be careful about how big your payloads get.

One of the original use cases for payloads was for doing XPath queries.

Also, the only thing experimental about Payloads is the actual
signature of the methods, not the need for them. If anything, I think
you will see an expansion of payload capability in the future. Also
note, that you will probably be interested in adding more Payload
querying capability. And also note, I am in the process of adding the
ability to get payloads from Spans, but I am not sure if this gets
into 2.3 or not.

Cheers,
Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


pgwillia at uwaterloo

Nov 17, 2007, 11:00 AM

Post #4 of 6 (286 views)
Permalink
Re: Payloads, Tokenizers, and Filters. Oh My! [In reply to]

Hi Grant,

Thanks for your response!

Taking a closer look at the TokenFilter(s) that causes my problem
with the Payload are all from org.apache.solr.analysis rather than
org.apache.lucene.analysis. I had originally thought that all the
TokenFilters available through Solr's TokenFilterFactory(s) were part of
Lucene. But I guess there are TokenFilters specific to Solr, such as
the WordDelimiterFilter, that aren't aware of Payloads. Thanks for
saying exactly the right thing to make me realize that.
> I guess you just want to be careful about how big your payloads get.
Erik Hatcher suggested storing the bulky XPath strings in a table of
contents field and just storing a smaller representation of the
information at each token with the intention of doing a lookup to get
the bulky stuff at query time.
>
> One of the original use cases for payloads was for doing XPath queries.
>
Has anyone actually completed anything with XPath queries and Payloads?
> Also, the only thing experimental about Payloads is the actual
> signature of the methods, not the need for them. If anything, I think
> you will see an expansion of payload capability in the future. Also
> note, that you will probably be interested in adding more Payload
> querying capability. And also note, I am in the process of adding the
> ability to get payloads from Spans, but I am not sure if this gets
> into 2.3 or not.
>
I look forward to seeing more of Payloads! I can already see how
they can be extremely useful.

Thanks,
Tricia

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


pgwillia at uwaterloo

Nov 18, 2007, 2:10 PM

Post #5 of 6 (274 views)
Permalink
Re: Payloads, Tokenizers, and Filters. Oh My! [In reply to]

I apologize for cross-posting but I believe both Solr and Lucene users
and developers should be concerned with this. I am not aware of a
better way to reach both communities.

In this email I'm looking for comments on:

* Do TokenFilters belong in the Solr code base at all?
* How to deal with TokenFilters that add new Tokens to the stream?
* How to patch TokenFilters and Tokenizers using the model of
LUCENE-969 in the Solr code base and in Lucene contrib?

Earlier in this thread I identified that at least one TokenFilter is
eating Payloads (WordDelimiterFilter).

Yonik pointed out:
>> Yes, this will be an issue for many custom tokenizers that don't yet
>> know about payloads but that create tokens. It's not clear what to do
>> in some cases when multiple tokens are created from one... should
>> identical payloads be created for the new tokens... it depends on what
>> the semantics of those payloads are.
>>
And I responded:
> I suppose that it is only fair to take this on a case by case basis.
> Maybe we will have to write new TokenFilters for each Tokenzier that
> uses Payloads (but I sure hope not!). Maybe we can build some
> optional configuration options into the TokenFilter constructor that
> guide their behavior with regard to Payloads. Maybe there is
> something stored in the TokenStream that dictates how the Payloads are
> handled by the TokenFilters. Maybe there is no case where identical
> payloads would not be created for new tokens and we can just change
> the TokenFilter to deal with payloads directly in a uniform way.

I thought it might be useful to figure out which existing TokenFilters
need to know about Payloads. To this end I have taken an inventory of
the TokenFilters out there. I think it is fair to categorize them by
Add (A), Delete (D), Modify (M), Observe (O):

*org.apache.solr.analysis.*HyphenatedWordsFilter, DM
*org.apache.solr.analysis.*KeepWordFilter, D
*org.apache.solr.analysis.*LengthFilter, D
*org.apache.solr.analysis.*PatternReplaceFilter, M
*org.apache.solr.analysis.*PhoneticFilter, AM
*org.apache.solr.analysis.*RemoveDuplicatesTokenFilter, D
*org.apache.solr.analysis.*SynonymFilter, ADM
*org.apache.solr.analysis.*TrimFilter, M
*org.apache.solr.analysis.*WordDelimiterFilter, AM
*org.apache.lucene.analysis.*CachingTokenFilter, O
*org.apache.lucene.analysis.*ISOLatin1AccentFilter, M
*org.apache.lucene.analysis.*LengthFilter, D
*org.apache.lucene.analysis.*LowerCaseFilter, M
*org.apache.lucene.analysis.*PorterStemFilter, M
*org.apache.lucene.analysis.*StopFilter, D
*org.apache.lucene.analysis.standard*.StandardFilter, M*
org.apache.lucene.analysis.br.*BrazilianStemFilter, M
*org.apache.lucene.analysis.cn.*ChineseFilter, D*
org.apache.lucene.analysis.de.*GermanStemFilter, M
*org.apache.lucene.analysis.el.*GreekLowerCaseFilter, M
*org.apache.lucene.analysis.fr.*ElisionFilter, M
*org.apache.lucene.analysis.fr.*FrenchStemFilter, M
*org.apache.lucene.analysis.ngram.*EdgeNGramTokenFilter, AM
*org.apache.lucene.analysis.ngram.*NGramTokenFilter, AM
*org.apache.lucene.analysis.nl.*DutchStemFilter, M
*org.apache.lucene.analysis.ru.*RussianLowerCaseFilter, M
*org.apache.lucene.analysis.ru.*RussianStemFilter, M
*org.apache.lucene.analysis.th.*ThaiWordFilter, AM
*org.apache.lucene.analysis.snowball.*SnowballFilter, M

Some characteristics of Add (A), Delete (D), Modify (M), Observe (O)
Add: new Token() and buffer of Tokens to consider before addressing
input.next()
Delete: loop ignoring tokens based on some criteria
Modify: new Token(), or use of Token set methods
Observe: rare CachingTokenFilter

The categories of TokenFilters that are affected by Payloads are add and
modify. The default behavior of TokenFilters which only delete or
observe return the Token fed through intact, hence the Payload will
remain intact.

Maybe the Lucene community has thought about this problem? I noticed
that the org.apache.lucene.analysis TokenFilters in the modify category
(there are none in the add category) refrain from using new Token().
That led me to the comment in the JavaDocs:
>
> *NOTE:* As of 2.3, Token stores the term text internally as a
> malleable char[] termBuffer instead of String termText. The indexing
> code and core tokenizers have been changed re-use a single Token
> instance, changing its buffer and other fields in-place as the Token
> is processed. This provides substantially better indexing performance
> as it saves the GC cost of new'ing a Token and String for every term.
> The APIs that accept String termText are still available but a warning
> about the associated performance cost has been added (below). The
> |termText()|
> <http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/Token.html#termText%28%29>
> method has been deprecated.
>
> Tokenizers and filters should try to re-use a Token instance when
> possible for best performance, by implementing the
> |TokenStream.next(Token)|
> <http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/TokenStream.html#next%28org.apache.lucene.analysis.Token%29>
> API. Failing that, to create a new Token you should first use one of
> the constructors that starts with null text. Then you should call
> either |termBuffer()|
> <http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/Token.html#termBuffer%28%29>
> or |resizeTermBuffer(int)|
> <http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/Token.html#resizeTermBuffer%28int%29>
> to retrieve the Token's termBuffer. Fill in the characters of your
> term into this buffer, and finally call |setTermLength(int)|
> <http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/Token.html#setTermLength%28int%29>
> to set the length of the term text. See LUCENE-969
> <https://issues.apache.org/jira/browse/LUCENE-969> for details.
>
The patch mentioned modifies the Tokenizers and TokenFilters in the
Lucene core code base to abide by the suggestions made. This would mean
that the TokenFilters in my modify category would have the default
behavior of the Payload of the modified Token remaining intact. I would
argue that when/if the Solr community starts using Lucene 2.3 that a
similar patch should be created for the TokenFilters there but I wonder
if the TokenFilters belong in Solr's domain at all. At some point the
TokenFilters and Tokenizers in the contrib sections of Lucene should
also be patched with the suggestions.

If this occurs then we only have to consider the add case. I don't
think we can avoid looking at this on a case by case basis, but most of
the add cases are providing alternate terms for the same position. In
that case the payload would simply be copied to the new Token much like
the Token's positionIncrement.

Thanks for your input,
Tricia

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


hossman_lucene at fucit

Nov 20, 2007, 10:38 AM

Post #6 of 6 (268 views)
Permalink
Re: Payloads, Tokenizers, and Filters. Oh My! [In reply to]

: I apologize for cross-posting but I believe both Solr and Lucene users and
: developers should be concerned with this. I am not aware of a better way to
: reach both communities.

some of these questions strike me as being largely unrelated. if
anyone wishes to followup on them further, let's do it in (new) seperate
threads for each topic, on the specific list appropriate to the topic...

: * Do TokenFilters belong in the Solr code base at all?

Yes, in so much as any java code belongs in the Solr code base (or the
nutch code base for that matter). They are seperate projects with
seperate communities and seperate needs -- that doesn't mean that there
isn't code in Solr which could be useful to the broader community of
lucene-java; in that case the appropriate course of action is to open a
LUCENE issue to "promote" the code up into lucene-java, and a dependent
issue in SOLR to deprecate the current code and use the newer code
instead.

as some people may be aware, there was a discussion aboutthis sort of
thing at ApacheCon during the Lucene BOF -- some reasons this doesn't
happen as often as it seems like it should are:
* the code may have subtle dependency tendrals that make it hard to
refactor from one code base to the other.
* the tests are frequently harder to "promote" then the code (in the
case of most Solr tests that use the TestHarness, it's probably easier
to write new tests from scratch)
* when promoting the code, it's the best time to consider wether the
existing API is really the "best" API before a lot of new people start
using it (compare Solr's FunctionQuery and Lucenes CustomScoreQuery
for example)
* someone needs to care enough to follow through on the promotion.

...further discussion is best suited for java-dev since the topic is not
Solr specific (there's a lot of Nutch code out there that people have sked
about promoting as well)

: * How to deal with TokenFilters that add new Tokens to the stream?

This is specificly regarding Payloads yes? also a pretty clear cut
java-dev discussion (and one possibly already being discussed in the
monolithic Payload API thread i haven't started reading yet).
lucene-java sets the API and the semantics ... Solr code will follow them.

: * How to patch TokenFilters and Tokenizers using the model of
: LUCENE-969 in the Solr code base and in Lucene contrib?

open SOLR issues containing a patchs for any Solr code that needs
changed, and LUCENE issues containing patches for contrib code that needs
changed.

: I thought it might be useful to figure out which existing TokenFilters need to
: know about Payloads. To this end I have taken an inventory of the
: TokenFilters out there. I think it is fair to categorize them by Add (A),
: Delete (D), Modify (M), Observe (O):

again: this is a straight forward luence-java question ... once the
semantics have been worked out, then there can be a Solr specific
discussion about following them.

(which is not to say that the Solr classes/use-cases shouldn't be
considered in the discussion, just that java-dev is the right place to
have the conversation)




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.