Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-Dev

Lazy Field Loading

 

 

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded


gsingers at syr

Mar 29, 2006, 5:31 AM

Post #1 of 19 (3299 views)
Permalink
Lazy Field Loading

I have a base implementation of lazy field loading that I am starting to
test and wanted to run my approach by everyone to hear their thoughts.

I have, as per Doug's suggestion from a while ago, created an interface
named Fieldable that is implemented by Field and a new, private class,
owned by FieldsReader. I have introduced an "enumerated" type to the
Field class named LazyLoad (which can be YES or NO, in the same spirit
as Field.TermVector). Any place that used to take Field now takes
Fieldable. This should be completely transparent and
backward-compatible. The existing constructors of field all assume lazy
to be off.

On creation of a Field, a user can pass in LazyLoad.YES or NO to a
constructor that takes either a String value or a byte array (it does
not apply to the Reader constructors since they do not store their
content). Indexing and writing of fields take place as normal, the only
difference being there is an extra bit added to the field writing that
marks the field as being lazy.

On reading in of the field, if it is Lazy, instead of reading in the
value for the field and constructing a Field, construct a LazyField
instance which takes in the pointer of the fieldsStream and the amount
of data to read. This instance, since it is a private class of
FieldsReader, maintains access to the fieldsStream. Thus, when a
application goes to access the value of the field, we check to see if it
is has been loaded or not. If it has not, we load it using the
fieldsStream, the pointer and the length to read.

Does anyone see any issues with this? I think it will only really pay
off on large stored fields, but have not quantified it yet. My main
concern is the semantics of the fieldsStream and whether that would be
closed behind the back of the LazyField implementation. My
understanding is that as long as the IndexReader is open, this stream
should also be open. Is that correct? What am I forgetting about?

If testing goes well, I should be able to button this up this week or
next and submit the patch.

--

Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
335 Hinds Hall
Syracuse, NY 13244

http://www.cnlp.org
Voice: 315-443-5484
Fax: 315-443-6886


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


erik at ehatchersolutions

Mar 29, 2006, 5:43 AM

Post #2 of 19 (3229 views)
Permalink
Re: Lazy Field Loading [In reply to]

Lazy loaded fields will be a nice addition to Lucene. I'm curious
why the flag is set at indexing time rather than it being something
that is controlled during retrieval somehow. I'm not sure what that
API would look like, but it seems its a decision to be addressed
during searching and reading of an index rather than during indexing
itself.

Erik


On Mar 29, 2006, at 8:31 AM, Grant Ingersoll wrote:

> I have a base implementation of lazy field loading that I am
> starting to test and wanted to run my approach by everyone to hear
> their thoughts.
>
> I have, as per Doug's suggestion from a while ago, created an
> interface named Fieldable that is implemented by Field and a new,
> private class, owned by FieldsReader. I have introduced an
> "enumerated" type to the Field class named LazyLoad (which can be
> YES or NO, in the same spirit as Field.TermVector). Any place that
> used to take Field now takes Fieldable. This should be completely
> transparent and backward-compatible. The existing constructors of
> field all assume lazy to be off.
>
> On creation of a Field, a user can pass in LazyLoad.YES or NO to a
> constructor that takes either a String value or a byte array (it
> does not apply to the Reader constructors since they do not store
> their content). Indexing and writing of fields take place as
> normal, the only difference being there is an extra bit added to
> the field writing that marks the field as being lazy.
>
> On reading in of the field, if it is Lazy, instead of reading in
> the value for the field and constructing a Field, construct a
> LazyField instance which takes in the pointer of the fieldsStream
> and the amount of data to read. This instance, since it is a
> private class of FieldsReader, maintains access to the
> fieldsStream. Thus, when a application goes to access the value of
> the field, we check to see if it is has been loaded or not. If it
> has not, we load it using the fieldsStream, the pointer and the
> length to read.
>
> Does anyone see any issues with this? I think it will only really
> pay off on large stored fields, but have not quantified it yet. My
> main concern is the semantics of the fieldsStream and whether that
> would be closed behind the back of the LazyField implementation.
> My understanding is that as long as the IndexReader is open, this
> stream should also be open. Is that correct? What am I
> forgetting about?
>
> If testing goes well, I should be able to button this up this week
> or next and submit the patch.
>
> --
>
> Grant Ingersoll Sr. Software Engineer Center for Natural Language
> Processing Syracuse University School of Information Studies 335
> Hinds Hall Syracuse, NY 13244
> http://www.cnlp.org Voice: 315-443-5484 Fax: 315-443-6886
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


gsingers at syr

Mar 29, 2006, 6:05 AM

Post #3 of 19 (3229 views)
Permalink
Re: Lazy Field Loading [In reply to]

Hmmm, I guess I always thought of it as a property of the field that
user's would want to explicitly control. I assumed that most fields
would not be lazy and a few would be.
Now that you have backed me up a bit on it (in a good way), I think it
could just as easily be a parameter that any field that is over a
specified size would be lazily loaded. With this approach, I could see:

IndexReader.document(int docNumber, long maxFieldSizeToLoad);

and IndexReader.document(int docNum) would just call this new method
passing in some default value, say 2K or something.

Or, we could pass in an array of field names to be lazily loaded to,
something like

IndexReader.document(int docNumber, String [] fieldNamesToLoadLazy);

The current way I have it looks something like (with a few other
variations):
public Field(String name, String value, Store store, Index index,
LazyLoad lazy)
and
public Field(String name, byte[] value, Store store, LazyLoad lazy)

for field constructors.

I am happy to do either way since the underlying mechanics are pretty
similar. What do others think?

-Grant

Erik Hatcher wrote:
> Lazy loaded fields will be a nice addition to Lucene. I'm curious
> why the flag is set at indexing time rather than it being something
> that is controlled during retrieval somehow. I'm not sure what that
> API would look like, but it seems its a decision to be addressed
> during searching and reading of an index rather than during indexing
> itself.
>
> Erik
>
>
> On Mar 29, 2006, at 8:31 AM, Grant Ingersoll wrote:
>
>> I have a base implementation of lazy field loading that I am starting
>> to test and wanted to run my approach by everyone to hear their
>> thoughts.
>>
>> I have, as per Doug's suggestion from a while ago, created an
>> interface named Fieldable that is implemented by Field and a new,
>> private class, owned by FieldsReader. I have introduced an
>> "enumerated" type to the Field class named LazyLoad (which can be YES
>> or NO, in the same spirit as Field.TermVector). Any place that used
>> to take Field now takes Fieldable. This should be completely
>> transparent and backward-compatible. The existing constructors of
>> field all assume lazy to be off.
>>
>> On creation of a Field, a user can pass in LazyLoad.YES or NO to a
>> constructor that takes either a String value or a byte array (it does
>> not apply to the Reader constructors since they do not store their
>> content). Indexing and writing of fields take place as normal, the
>> only difference being there is an extra bit added to the field
>> writing that marks the field as being lazy.
>>
>> On reading in of the field, if it is Lazy, instead of reading in the
>> value for the field and constructing a Field, construct a LazyField
>> instance which takes in the pointer of the fieldsStream and the
>> amount of data to read. This instance, since it is a private class
>> of FieldsReader, maintains access to the fieldsStream. Thus, when a
>> application goes to access the value of the field, we check to see if
>> it is has been loaded or not. If it has not, we load it using the
>> fieldsStream, the pointer and the length to read.
>>
>> Does anyone see any issues with this? I think it will only really
>> pay off on large stored fields, but have not quantified it yet. My
>> main concern is the semantics of the fieldsStream and whether that
>> would be closed behind the back of the LazyField implementation. My
>> understanding is that as long as the IndexReader is open, this stream
>> should also be open. Is that correct? What am I forgetting about?
>>
>> If testing goes well, I should be able to button this up this week or
>> next and submit the patch.
>>
>> --
>> Grant Ingersoll Sr. Software Engineer Center for Natural Language
>> Processing Syracuse University School of Information Studies 335
>> Hinds Hall Syracuse, NY 13244
>> http://www.cnlp.org Voice: 315-443-5484 Fax: 315-443-6886
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>> For additional commands, e-mail: java-dev-help [at] lucene
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>
>

--

Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
335 Hinds Hall
Syracuse, NY 13244

http://www.cnlp.org
Voice: 315-443-5484
Fax: 315-443-6886


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


donovan_aaron at bah

Mar 29, 2006, 6:13 AM

Post #4 of 19 (3224 views)
Permalink
RE: Lazy Field Loading [In reply to]

I've done a lot of work with Verity's search engine, and I like the way
they handle fields. At query time you specify the fields you want
returned from matching documents.

Aaron

-----Original Message-----
From: Grant Ingersoll [mailto:gsingers [at] syr]
Sent: Wednesday, March 29, 2006 9:05 AM
To: java-dev [at] lucene
Subject: Re: Lazy Field Loading

Hmmm, I guess I always thought of it as a property of the field that
user's would want to explicitly control. I assumed that most fields
would not be lazy and a few would be.
Now that you have backed me up a bit on it (in a good way), I think it
could just as easily be a parameter that any field that is over a
specified size would be lazily loaded. With this approach, I could see:

IndexReader.document(int docNumber, long maxFieldSizeToLoad);

and IndexReader.document(int docNum) would just call this new method
passing in some default value, say 2K or something.

Or, we could pass in an array of field names to be lazily loaded to,
something like

IndexReader.document(int docNumber, String [] fieldNamesToLoadLazy);

The current way I have it looks something like (with a few other
variations):
public Field(String name, String value, Store store, Index index,
LazyLoad lazy) and public Field(String name, byte[] value, Store store,
LazyLoad lazy)

for field constructors.

I am happy to do either way since the underlying mechanics are pretty
similar. What do others think?

-Grant

Erik Hatcher wrote:
> Lazy loaded fields will be a nice addition to Lucene. I'm curious
> why the flag is set at indexing time rather than it being something
> that is controlled during retrieval somehow. I'm not sure what that
> API would look like, but it seems its a decision to be addressed
> during searching and reading of an index rather than during indexing
> itself.
>
> Erik
>
>
> On Mar 29, 2006, at 8:31 AM, Grant Ingersoll wrote:
>
>> I have a base implementation of lazy field loading that I am starting

>> to test and wanted to run my approach by everyone to hear their
>> thoughts.
>>
>> I have, as per Doug's suggestion from a while ago, created an
>> interface named Fieldable that is implemented by Field and a new,
>> private class, owned by FieldsReader. I have introduced an
>> "enumerated" type to the Field class named LazyLoad (which can be YES

>> or NO, in the same spirit as Field.TermVector). Any place that used
>> to take Field now takes Fieldable. This should be completely
>> transparent and backward-compatible. The existing constructors of
>> field all assume lazy to be off.
>>
>> On creation of a Field, a user can pass in LazyLoad.YES or NO to a
>> constructor that takes either a String value or a byte array (it does

>> not apply to the Reader constructors since they do not store their
>> content). Indexing and writing of fields take place as normal, the
>> only difference being there is an extra bit added to the field
>> writing that marks the field as being lazy.
>>
>> On reading in of the field, if it is Lazy, instead of reading in the
>> value for the field and constructing a Field, construct a LazyField
>> instance which takes in the pointer of the fieldsStream and the
>> amount of data to read. This instance, since it is a private class
>> of FieldsReader, maintains access to the fieldsStream. Thus, when a
>> application goes to access the value of the field, we check to see if

>> it is has been loaded or not. If it has not, we load it using the
>> fieldsStream, the pointer and the length to read.
>>
>> Does anyone see any issues with this? I think it will only really
>> pay off on large stored fields, but have not quantified it yet. My
>> main concern is the semantics of the fieldsStream and whether that
>> would be closed behind the back of the LazyField implementation. My
>> understanding is that as long as the IndexReader is open, this stream
>> should also be open. Is that correct? What am I forgetting about?
>>
>> If testing goes well, I should be able to button this up this week or

>> next and submit the patch.
>>
>> --
>> Grant Ingersoll Sr. Software Engineer Center for Natural Language
>> Processing Syracuse University School of Information Studies 335
>> Hinds Hall Syracuse, NY 13244 http://www.cnlp.org Voice:
>> 315-443-5484 Fax: 315-443-6886
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>> For additional commands, e-mail: java-dev-help [at] lucene
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>
>

--

Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
335 Hinds Hall
Syracuse, NY 13244

http://www.cnlp.org
Voice: 315-443-5484
Fax: 315-443-6886


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


gsingers at syr

Mar 29, 2006, 10:53 AM

Post #5 of 19 (3228 views)
Permalink
Re: Lazy Field Loading [In reply to]

Of course, another option is to make all fields lazy all the time and
the user never even needs to think about it. Need some strategy for
when the IndexReader gets closed, but we have this in all cases.


Donovan Aaron wrote:
> I've done a lot of work with Verity's search engine, and I like the way
> they handle fields. At query time you specify the fields you want
> returned from matching documents.
>
> Aaron
>
> -----Original Message-----
> From: Grant Ingersoll [mailto:gsingers [at] syr]
> Sent: Wednesday, March 29, 2006 9:05 AM
> To: java-dev [at] lucene
> Subject: Re: Lazy Field Loading
>
> Hmmm, I guess I always thought of it as a property of the field that
> user's would want to explicitly control. I assumed that most fields
> would not be lazy and a few would be.
> Now that you have backed me up a bit on it (in a good way), I think it
> could just as easily be a parameter that any field that is over a
> specified size would be lazily loaded. With this approach, I could see:
>
> IndexReader.document(int docNumber, long maxFieldSizeToLoad);
>
> and IndexReader.document(int docNum) would just call this new method
> passing in some default value, say 2K or something.
>
> Or, we could pass in an array of field names to be lazily loaded to,
> something like
>
> IndexReader.document(int docNumber, String [] fieldNamesToLoadLazy);
>
> The current way I have it looks something like (with a few other
> variations):
> public Field(String name, String value, Store store, Index index,
> LazyLoad lazy) and public Field(String name, byte[] value, Store store,
> LazyLoad lazy)
>
> for field constructors.
>
> I am happy to do either way since the underlying mechanics are pretty
> similar. What do others think?
>
> -Grant
>
> Erik Hatcher wrote:
>
>> Lazy loaded fields will be a nice addition to Lucene. I'm curious
>> why the flag is set at indexing time rather than it being something
>> that is controlled during retrieval somehow. I'm not sure what that
>> API would look like, but it seems its a decision to be addressed
>> during searching and reading of an index rather than during indexing
>> itself.
>>
>> Erik
>>
>>
>> On Mar 29, 2006, at 8:31 AM, Grant Ingersoll wrote:
>>
>>
>>> I have a base implementation of lazy field loading that I am starting
>>>
>
>
>>> to test and wanted to run my approach by everyone to hear their
>>> thoughts.
>>>
>>> I have, as per Doug's suggestion from a while ago, created an
>>> interface named Fieldable that is implemented by Field and a new,
>>> private class, owned by FieldsReader. I have introduced an
>>> "enumerated" type to the Field class named LazyLoad (which can be YES
>>>
>
>
>>> or NO, in the same spirit as Field.TermVector). Any place that used
>>> to take Field now takes Fieldable. This should be completely
>>> transparent and backward-compatible. The existing constructors of
>>> field all assume lazy to be off.
>>>
>>> On creation of a Field, a user can pass in LazyLoad.YES or NO to a
>>> constructor that takes either a String value or a byte array (it does
>>>
>
>
>>> not apply to the Reader constructors since they do not store their
>>> content). Indexing and writing of fields take place as normal, the
>>> only difference being there is an extra bit added to the field
>>> writing that marks the field as being lazy.
>>>
>>> On reading in of the field, if it is Lazy, instead of reading in the
>>> value for the field and constructing a Field, construct a LazyField
>>> instance which takes in the pointer of the fieldsStream and the
>>> amount of data to read. This instance, since it is a private class
>>> of FieldsReader, maintains access to the fieldsStream. Thus, when a
>>> application goes to access the value of the field, we check to see if
>>>
>
>
>>> it is has been loaded or not. If it has not, we load it using the
>>> fieldsStream, the pointer and the length to read.
>>>
>>> Does anyone see any issues with this? I think it will only really
>>> pay off on large stored fields, but have not quantified it yet. My
>>> main concern is the semantics of the fieldsStream and whether that
>>> would be closed behind the back of the LazyField implementation. My
>>> understanding is that as long as the IndexReader is open, this stream
>>> should also be open. Is that correct? What am I forgetting about?
>>>
>>> If testing goes well, I should be able to button this up this week or
>>>
>
>
>>> next and submit the patch.
>>>
>>> --
>>> Grant Ingersoll Sr. Software Engineer Center for Natural Language
>>> Processing Syracuse University School of Information Studies 335
>>> Hinds Hall Syracuse, NY 13244 http://www.cnlp.org Voice:
>>> 315-443-5484 Fax: 315-443-6886
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-dev-help [at] lucene
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>> For additional commands, e-mail: java-dev-help [at] lucene
>>
>>
>>
>
>

--

Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
335 Hinds Hall
Syracuse, NY 13244

http://www.cnlp.org
Voice: 315-443-5484
Fax: 315-443-6886


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


cutting at apache

Mar 29, 2006, 12:16 PM

Post #6 of 19 (3244 views)
Permalink
Re: Lazy Field Loading [In reply to]

Grant Ingersoll wrote:
> My main
> concern is the semantics of the fieldsStream and whether that would be
> closed behind the back of the LazyField implementation. My
> understanding is that as long as the IndexReader is open, this stream
> should also be open. Is that correct? What am I forgetting about?

You need to make sure that access to the stream is synchronized, so that
one thread doesn't move the file pointer while someone else is reading.
You could use a cloned stream in a ThreadLocal to avoid contention.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


gsingers at syr

Mar 31, 2006, 4:21 AM

Post #7 of 19 (3213 views)
Permalink
Re: Lazy Field Loading [In reply to]

OK, how about a vote on this.

I see several ways of implementing the front end to this:

1. Declarative: On construction of a Document, you declare the Field to
be Lazy.

2. Implicit: All fields are Lazy

3. Size of Field. Pass into IndexReader.document() the size of the
field above which it will be lazily loaded. A default size can also be
used.

4. By Field name. Pass in the names of the Fields that you want loaded
lazily.

Thanks,
Grant

Doug Cutting wrote:
> Grant Ingersoll wrote:
>> My main concern is the semantics of the fieldsStream and whether that
>> would be closed behind the back of the LazyField implementation. My
>> understanding is that as long as the IndexReader is open, this stream
>> should also be open. Is that correct? What am I forgetting about?
>
> You need to make sure that access to the stream is synchronized, so
> that one thread doesn't move the file pointer while someone else is
> reading. You could use a cloned stream in a ThreadLocal to avoid
> contention.
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>
>

--

Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
335 Hinds Hall
Syracuse, NY 13244

http://www.cnlp.org
Voice: 315-443-5484
Fax: 315-443-6886


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


markharw00d at yahoo

Mar 31, 2006, 5:10 AM

Post #8 of 19 (3209 views)
Permalink
Re: Lazy Field Loading [In reply to]

I'd prefer option 4.
Users should expect to provide some form of guidance
to the engine about how they are going to access the
data if it is expected to be retrieved efficiently.

Preferably this choice of field loading policy should
NOT be "baked in" at index time because index access
patterns can vary (ruling out options 1 and 3)

I think option 4, the reader.document(int docid,
String[]fields) approach is a reasonable option and is
analogous to the "select a,b" part of a SQL statement.

It seems to be the most flexible and is not likely to
be seen as an unnecessary burden by end users familiar
with SQL. We should also have a "select *" equivalent
for those uninterested in being selective.

I suspect your option "2" (all fields are implicitly
lazy) could have a hard time second-guessing how
people are accessing the docs?


Cheers
Mark


Send instant messages to your online friends http://uk.messenger.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


erik at ehatchersolutions

Mar 31, 2006, 5:29 AM

Post #9 of 19 (3224 views)
Permalink
Re: Lazy Field Loading [In reply to]

I prefer option #4 myself. Also note that a similar issue with
patches exists within JIRA:

<https://issues.apache.org:443/jira/browse/LUCENE-509>

Erik


On Mar 31, 2006, at 7:21 AM, Grant Ingersoll wrote:

> OK, how about a vote on this.
>
> I see several ways of implementing the front end to this:
>
> 1. Declarative: On construction of a Document, you declare the
> Field to be Lazy.
>
> 2. Implicit: All fields are Lazy
>
> 3. Size of Field. Pass into IndexReader.document() the size of the
> field above which it will be lazily loaded. A default size can
> also be used.
>
> 4. By Field name. Pass in the names of the Fields that you want
> loaded lazily.
>
> Thanks,
> Grant
>
> Doug Cutting wrote:
>> Grant Ingersoll wrote:
>>> My main concern is the semantics of the fieldsStream and whether
>>> that would be closed behind the back of the LazyField
>>> implementation. My understanding is that as long as the
>>> IndexReader is open, this stream should also be open. Is that
>>> correct? What am I forgetting about?
>>
>> You need to make sure that access to the stream is synchronized,
>> so that one thread doesn't move the file pointer while someone
>> else is reading. You could use a cloned stream in a ThreadLocal
>> to avoid contention.
>>
>> Doug
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
>> For additional commands, e-mail: java-dev-help [at] lucene
>>
>>
>
> --
>
> Grant Ingersoll Sr. Software Engineer Center for Natural Language
> Processing Syracuse University School of Information Studies 335
> Hinds Hall Syracuse, NY 13244
> http://www.cnlp.org Voice: 315-443-5484 Fax: 315-443-6886
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


gsingers at syr

Mar 31, 2006, 5:30 AM

Post #10 of 19 (3222 views)
Permalink
Re: Lazy Field Loading [In reply to]

mark harwood wrote:
> Preferably this choice of field loading policy should
> NOT be "baked in" at index time because index access
> patterns can vary (ruling out options 1 and 3)
>

I don't think option 3 is baked in at indexing time. I think it would
look like:

| IndexReader.document(int docNumber, long maxFieldSizeToLoad);
| and IndexReader.document(int docNum) would just call this new method
passing in some default value, say 2K or something

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


yseeley at gmail

Mar 31, 2006, 6:24 AM

Post #11 of 19 (3220 views)
Permalink
Re: Lazy Field Loading [In reply to]

On 3/31/06, Erik Hatcher <erik [at] ehatchersolutions> wrote:
> I prefer option #4 myself. Also note that a similar issue with
> patches exists within JIRA:
>
> <https://issues.apache.org:443/jira/browse/LUCENE-509>

Yes, I'd personally find a way to retrieve just fields x,y, and z more
useful than lazy loading.
It seems like lazy loading could be useful if you do something with
field values that is conditional on the value of other fields... a
case I haven't run into.

-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


markharw00d at yahoo

Mar 31, 2006, 6:31 AM

Post #12 of 19 (3227 views)
Permalink
Re: Lazy Field Loading [In reply to]

> I don't think option 3 is baked in at indexing time.

Sorry, I misread it. Yes, that is another option.

So if options 3 and 4 are about search-time selection
(based on size and fieldname respectively) can they be
generalized into a more wide-reaching retrieval API?

You can imagine a high-level retrieval language like
this:

Select url, length(contents), substring(descr,0,50)

..where we have 3 items being returned. The first item
(url) is a straight copy of the original field data,
the second is the size in bytes of the "contents"
field and the third is a summary of the "descr" field
(in this case a simple substring but conceivably could
be a more sophisticated summarizer eg the highlighter)

If you think of each of these as retrieval functions
we have an API that looks something like this:

IndexReader.document(int doc,
RetrieveFunction []retrievers);

interface RetreiveFunction {
Object getValue(FieldMetaData f);
}

interface FieldMetaData
{
String getFieldName()
int getSize();
InputStream getInputStream();
}

The reader calls the retrievers with a FieldMetaData
object for each field and the data is only loaded from
disk if a retrievefunction "bites" and asks for the
InputStream to get the content for a field.
You can imagine the different RetrieveFunction
implementations could then not only choose which
fields are returned but also how much content and in
what format.

I'm not sure if there is a sufficently long list of
different retriever functions that would make this a
useful approach.


Cheers
Mark

Send instant messages to your online friends http://uk.messenger.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


ab at getopt

Mar 31, 2006, 6:48 AM

Post #13 of 19 (3223 views)
Permalink
Re: Lazy Field Loading [In reply to]

Yonik Seeley wrote:
> On 3/31/06, Erik Hatcher <erik [at] ehatchersolutions> wrote:
>
>> I prefer option #4 myself. Also note that a similar issue with
>> patches exists within JIRA:
>>
>> <https://issues.apache.org:443/jira/browse/LUCENE-509>
>>
>
> Yes, I'd personally find a way to retrieve just fields x,y, and z more
> useful than lazy loading.
> It seems like lazy loading could be useful if you do something with
> field values that is conditional on the value of other fields... a
> case I haven't run into.
>

Use cases in Nutch would also indicate that #4 is the most convenient
option, and rule out options #1 and #3 (and perhaps #2 due to
efficiency). Various fields from Lucene indexes are used for e.g.
sorting, where sorting field is selected by users during run time. Some
field values help with Hits presentation, while other values should only
be retrieved when requesting all hit's "metadata" - again, using the
same index. So, option #4 would be really useful.

--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


gsingers at syr

Mar 31, 2006, 12:42 PM

Post #14 of 19 (3210 views)
Permalink
Re: Lazy Field Loading [In reply to]

OK, #4 it is. I will most likely submit a patch this weekend.


Andrzej Bialecki wrote:
> Yonik Seeley wrote:
>> On 3/31/06, Erik Hatcher <erik [at] ehatchersolutions> wrote:
>>
>>> I prefer option #4 myself. Also note that a similar issue with
>>> patches exists within JIRA:
>>>
>>> <https://issues.apache.org:443/jira/browse/LUCENE-509>
>>>
>>
>> Yes, I'd personally find a way to retrieve just fields x,y, and z more
>> useful than lazy loading.
>> It seems like lazy loading could be useful if you do something with
>> field values that is conditional on the value of other fields... a
>> case I haven't run into.
>>
>
> Use cases in Nutch would also indicate that #4 is the most convenient
> option, and rule out options #1 and #3 (and perhaps #2 due to
> efficiency). Various fields from Lucene indexes are used for e.g.
> sorting, where sorting field is selected by users during run time.
> Some field values help with Hits presentation, while other values
> should only be retrieved when requesting all hit's "metadata" - again,
> using the same index. So, option #4 would be really useful.
>

--

Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
335 Hinds Hall
Syracuse, NY 13244

http://www.cnlp.org
Voice: 315-443-5484
Fax: 315-443-6886


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


yseeley at gmail

Apr 4, 2006, 1:16 PM

Post #15 of 19 (3190 views)
Permalink
Re: Lazy Field Loading [In reply to]

On 3/31/06, Yonik Seeley <yseeley [at] gmail> wrote:
> > <https://issues.apache.org:443/jira/browse/LUCENE-509>
>
> Yes, I'd personally find a way to retrieve just fields x,y, and z more
> useful than lazy loading.

Thinking a little more, it would be nice if the field reading API was
opened up a little more so that multiple things could be done... even
construct different field/document objects (say a document
implementation that indexed the fields, etc).
That could be used to implement either lazy field loading, or loading
of specific fields.

The lazy loading alone doesn't really address LUCENE-509

I was thinking something along the lines of

// an IndexReader would call FieldReader methods for each
abstract class FieldReader {
boolean readField(int fieldnum, String fieldName); // users return
true if this field should be read.
boolean stringField(int fieldnum, byte[] utf8); // returns true to
keep reading next field
OR
boolean stringField(int fieldnum, String str); // returns true to
keep reading next field
boolean binaryField(int fieldnum, byte[] data); // returns true to
keep reading next field
}

class IndexReader {
// expert level API
void readFields(int doc, FieldReader reader);
}

Just brainstorming so far...

-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


gsingers at syr

Apr 4, 2006, 1:40 PM

Post #16 of 19 (3172 views)
Permalink
Re: Lazy Field Loading [In reply to]

Your right, more flexibility is needed, but it goes beyond just field
loading in my mind. I think this is what Doug was getting at (at least
partially) with http://wiki.apache.org/jakarta-lucene/Lucene2Whiteboard
#12 although that focuses on Indexing, I think it should be considered
for searching. I am not sure we should just continue adding more and
more methods onto IndexReader. I think the 2.x move gives us an
opportunity to refactor some of the things we think we can make better.

I am not sure you need 509 when you have Lazy loading. In my mind, you
have the best of both worlds. You can get all the meta-info about all
the stored fields on the Document w/o the penalty of loading the actual
data.

My use case is below (my guess is this is quite common).

Run a search, get back your hits and display summary information on the
hits (i.e. the "small" fields). User picks the Hit they want to see
more info on, go display the full document, including, most likely, the
info in the really large stored fields (i.e the original document). To
date, I have been storing this info elsewhere b/c of the loading
penalty. With lazy loading, I don't need to do this. I can just defer
loading until the second level access is needed and I never load it if
the user doesn't ask for it.

In the case where you only get a few smaller fields, you have to go back
and get the document again when you want to display the contents of the
large field.

Of course, there are several other use cases where you may only want
certain fields, but I don't think there is much cost associated with
loading small fields, just the large ones, so you can just make them lazy.


Yonik Seeley wrote:
> On 3/31/06, Yonik Seeley <yseeley [at] gmail> wrote:
>
>>> <https://issues.apache.org:443/jira/browse/LUCENE-509>
>>>
>> Yes, I'd personally find a way to retrieve just fields x,y, and z more
>> useful than lazy loading.
>>
>
> Thinking a little more, it would be nice if the field reading API was
> opened up a little more so that multiple things could be done... even
> construct different field/document objects (say a document
> implementation that indexed the fields, etc).
> That could be used to implement either lazy field loading, or loading
> of specific fields.
>
> The lazy loading alone doesn't really address LUCENE-509
>
> I was thinking something along the lines of
>
> // an IndexReader would call FieldReader methods for each
> abstract class FieldReader {
> boolean readField(int fieldnum, String fieldName); // users return
> true if this field should be read.
> boolean stringField(int fieldnum, byte[] utf8); // returns true to
> keep reading next field
> OR
> boolean stringField(int fieldnum, String str); // returns true to
> keep reading next field
> boolean binaryField(int fieldnum, byte[] data); // returns true to
> keep reading next field
> }
>
> class IndexReader {
> // expert level API
> void readFields(int doc, FieldReader reader);
> }
>
> Just brainstorming so far...
>
> -Yonik
> http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>
>

--

Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
335 Hinds Hall
Syracuse, NY 13244

http://www.cnlp.org
Voice: 315-443-5484
Fax: 315-443-6886


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


yseeley at gmail

Apr 4, 2006, 2:20 PM

Post #17 of 19 (3186 views)
Permalink
Re: Lazy Field Loading [In reply to]

On 4/4/06, Grant Ingersoll <gsingers [at] syr> wrote:
> I am not sure you need 509 when you have Lazy loading.

It would be nice to avoid creating a Field object at all... we have
some crazy documents with more than 1000 fields :-) I think the Field
object itself takes up more room than the data.

For my usecases, specifying which fields should be lazily loaded
doesn't work well... I know which fields I want, not which ones I
don't.

> My use case is below (my guess is this is quite common).
>
> Run a search, get back your hits and display summary information on the
> hits (i.e. the "small" fields). User picks the Hit they want to see
> more info on, go display the full document

It seems like the only way this can work is if you keep the index
searcher open and cache the Hits object that the user used. How long
do you keep that searcher open waiting for the user to do something?
I guess it could work as long as you have logic to re-execute the
query if the searcher changes...

> , including, most likely, the
> info in the really large stored fields (i.e the original document). To
> date, I have been storing this info elsewhere b/c of the loading
> penalty. With lazy loading, I don't need to do this. I can just defer
> loading until the second level access is needed and I never load it if
> the user doesn't ask for it.

Actually, for really large text fields, I can see that you wouldn't
want lucene to re-parse the fields anyway, so I agree that lazy
loading helps there.

> In the case where you only get a few smaller fields, you have to go back
> and get the document again when you want to display the contents of the
> large field.
>
> Of course, there are several other use cases where you may only want
> certain fields, but I don't think there is much cost associated with
> loading small fields, just the large ones, so you can just make them lazy.

Part of the cost is iterating through all the fields of the Document
looking for the one or two you want.

-Yonik


> Yonik Seeley wrote:
> > On 3/31/06, Yonik Seeley <yseeley [at] gmail> wrote:
> >
> >>> <https://issues.apache.org:443/jira/browse/LUCENE-509>
> >>>
> >> Yes, I'd personally find a way to retrieve just fields x,y, and z more
> >> useful than lazy loading.
> >>
> >
> > Thinking a little more, it would be nice if the field reading API was
> > opened up a little more so that multiple things could be done... even
> > construct different field/document objects (say a document
> > implementation that indexed the fields, etc).
> > That could be used to implement either lazy field loading, or loading
> > of specific fields.
> >
> > The lazy loading alone doesn't really address LUCENE-509
> >
> > I was thinking something along the lines of
> >
> > // an IndexReader would call FieldReader methods for each
> > abstract class FieldReader {
> > boolean readField(int fieldnum, String fieldName); // users return
> > true if this field should be read.
> > boolean stringField(int fieldnum, byte[] utf8); // returns true to
> > keep reading next field
> > OR
> > boolean stringField(int fieldnum, String str); // returns true to
> > keep reading next field
> > boolean binaryField(int fieldnum, byte[] data); // returns true to
> > keep reading next field
> > }
> >
> > class IndexReader {
> > // expert level API
> > void readFields(int doc, FieldReader reader);
> > }
> >
> > Just brainstorming so far...
> >
> > -Yonik
> > http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> > For additional commands, e-mail: java-dev-help [at] lucene
> >
> >
>
> --
>
> Grant Ingersoll
> Sr. Software Engineer
> Center for Natural Language Processing
> Syracuse University
> School of Information Studies
> 335 Hinds Hall
> Syracuse, NY 13244
>
> http://www.cnlp.org
> Voice: 315-443-5484
> Fax: 315-443-6886
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
> For additional commands, e-mail: java-dev-help [at] lucene
>
>


--
-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


gsingers at syr

Apr 4, 2006, 2:48 PM

Post #18 of 19 (3182 views)
Permalink
Re: Lazy Field Loading [In reply to]

Yonik Seeley wrote:
> On 4/4/06, Grant Ingersoll <gsingers [at] syr> wrote:
>
>> I am not sure you need 509 when you have Lazy loading.
>>
>
> It would be nice to avoid creating a Field object at all... we have
> some crazy documents with more than 1000 fields :-) I think the Field
> object itself takes up more room than the data.
>
> For my usecases, specifying which fields should be lazily loaded
> doesn't work well... I know which fields I want, not which ones I
> don't.
>
>
true, true. In looking at the code, I don't think it is that hard to
do. As 509 states, the main issue is you still need to read in all the
Fields in a document.

Mark Harwood had an interesting post earlier on this same thread about
some other possibilities for interfaces.


>> My use case is below (my guess is this is quite common).
>>
>> Run a search, get back your hits and display summary information on the
>> hits (i.e. the "small" fields). User picks the Hit they want to see
>> more info on, go display the full document
>>
>
> It seems like the only way this can work is if you keep the index
> searcher open and cache the Hits object that the user used. How long
> do you keep that searcher open waiting for the user to do something?
> I guess it could work as long as you have logic to re-execute the
> query if the searcher changes...
>

Yeah, we aren't updating a lot, so we cache the searchers. If you
followed the other thread I have going on the "Semantics of
IndexInput...", Doug and I discussed that accessing the stream becomes
undefined after the stream is closed. So, while it does still work to
load in some cases, it isn't guaranteed and any application would need
to be able to handle this.
>
>> , including, most likely, the
>> info in the really large stored fields (i.e the original document). To
>> date, I have been storing this info elsewhere b/c of the loading
>> penalty. With lazy loading, I don't need to do this. I can just defer
>> loading until the second level access is needed and I never load it if
>> the user doesn't ask for it.
>>
>
> Actually, for really large text fields, I can see that you wouldn't
> want lucene to re-parse the fields anyway, so I agree that lazy
> loading helps there.
>
>
>> In the case where you only get a few smaller fields, you have to go back
>> and get the document again when you want to display the contents of the
>> large field.
>>
>> Of course, there are several other use cases where you may only want
>> certain fields, but I don't think there is much cost associated with
>> loading small fields, just the large ones, so you can just make them lazy.
>>
>
> Part of the cost is iterating through all the fields of the Document
> looking for the one or two you want.
>
>

Yeah, not sure if there is a good solution to this. Maybe altering the
file formats such that you store all the meta info about a field up
front and then the field data somehow. This would at least speed it
up. One of the things I think both SOLR and what we call IR Tools at
CNLP (see my ApacheCon talk) does is provide better access to the
metadata about fields/indexes, etc. It is hard, in Lucene, to know what
fields belong to what documents and how they are indexed. You must save
this information in your application, even though most, if not all of
it, is already in Lucene in some form.

I will take a crack at this sometime later and see if I can implement
some of the ideas we have discussed.

As I see it, we have a few goals:
1. Retrieve only the fields someone wants
2. Retrieve only all fields, but leave some to be lazily loaded
3. Provide SQL like functionality (as Mark suggested) [a bit harder and
more involved????]

--

Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
335 Hinds Hall
Syracuse, NY 13244

http://www.cnlp.org
Voice: 315-443-5484
Fax: 315-443-6886


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene


yseeley at gmail

Apr 4, 2006, 3:07 PM

Post #19 of 19 (3184 views)
Permalink
Re: Lazy Field Loading [In reply to]

On 4/4/06, Grant Ingersoll <gsingers [at] syr> wrote:
> As I see it, we have a few goals:
> 1. Retrieve only the fields someone wants
> 2. Retrieve only all fields, but leave some to be lazily loaded
> 3. Provide SQL like functionality (as Mark suggested) [a bit harder and
> more involved????]

/** expert
* fields==null means all, fields=={} means none
* lazyFields==null means all, lazyFields=={} means none
*/
Document document(int doc, Set<String> fields, Set<String> lazyFields);

OR

something like the FieldSelector interface Doug mentioned in LUCENE-509


-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe [at] lucene
For additional commands, e-mail: java-dev-help [at] lucene

Lucene java-dev RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.