Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

large document with multiple fields performance

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


SGreene at metalseconomics

Sep 8, 2009, 4:57 AM

Post #1 of 5 (696 views)
Permalink
large document with multiple fields performance

Hello,



I am new to lucene and building an application which requires documents
with many fields to be searched.

A "project" id is being stored (not_analyzed) and all matching project
ids will be returned to be used to join other data from a database.

Will it provide better performance to store each comment field in a
separate document with the project ID and a comment ID or to store all
the comments for a single project in a single document with multiple
fields?



Thanks,



Steve Greene


anshumg at gmail

Sep 8, 2009, 5:46 AM

Post #2 of 5 (661 views)
Permalink
Re: large document with multiple fields performance [In reply to]

Hi Stephen,
Could you clarify more on the requirement. Do you intend to have data in
index as:
Document{
String Comment;
String CommentId;
String ProjectId;
}

How do you intend to index it.. as in the doc structure? Is there a primary
key there? What would you search on? What would you want to have as the
result?
All said and done, its not really an overhead as long as the number of
fields is within normal bounds.


--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com

The facts expressed here belong to everybody, the opinions to me. The
distinction is yours to draw............


On Tue, Sep 8, 2009 at 5:27 PM, Stephen Greene
<SGreene [at] metalseconomics>wrote:

> Hello,
>
>
>
> I am new to lucene and building an application which requires documents
> with many fields to be searched.
>
> A "project" id is being stored (not_analyzed) and all matching project
> ids will be returned to be used to join other data from a database.
>
> Will it provide better performance to store each comment field in a
> separate document with the project ID and a comment ID or to store all
> the comments for a single project in a single document with multiple
> fields?
>
>
>
> Thanks,
>
>
>
> Steve Greene
>
>


SGreene at metalseconomics

Sep 8, 2009, 6:27 AM

Post #3 of 5 (652 views)
Permalink
RE: large document with multiple fields performance [In reply to]

Hi Anshum,

Thank you for your reply. I have two options I am considering.
One would be:
Document {
String projectID;
String generalComment;
String workHistoryComment;
String environmentalComment;
String claimsComment;
...
}

And the document may contain upwards of 20 comment fields.

The other option would be to normalize the data
Document {
String projectID;
String commentType;
String comment;
}

I will need to return only the projectID for all found documents. I have
implemented a custom Collector to capture the projectID for each
document. Then it occurred to me that I might be better served by the
normalized document model. But I am wondering which method will have
better performance: possibly returning 20 documents per hit, or having
to search 20 fields per document? (This also has implications for the
query, as each search term will always search all fields, this is
somewhat easier in the normalized example as opposed to creating 20 "or"
queries.)

Thanks,

Steve

-----Original Message-----
From: Anshum [mailto:anshumg [at] gmail]
Sent: Tuesday, September 08, 2009 9:47 AM
To: java-user [at] lucene
Subject: Re: large document with multiple fields performance

Hi Stephen,
Could you clarify more on the requirement. Do you intend to have data in
index as:
Document{
String Comment;
String CommentId;
String ProjectId;
}

How do you intend to index it.. as in the doc structure? Is there a
primary
key there? What would you search on? What would you want to have as the
result?
All said and done, its not really an overhead as long as the number of
fields is within normal bounds.


--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com

The facts expressed here belong to everybody, the opinions to me. The
distinction is yours to draw............


On Tue, Sep 8, 2009 at 5:27 PM, Stephen Greene
<SGreene [at] metalseconomics>wrote:

> Hello,
>
>
>
> I am new to lucene and building an application which requires
documents
> with many fields to be searched.
>
> A "project" id is being stored (not_analyzed) and all matching project
> ids will be returned to be used to join other data from a database.
>
> Will it provide better performance to store each comment field in a
> separate document with the project ID and a comment ID or to store all
> the comments for a single project in a single document with multiple
> fields?
>
>
>
> Thanks,
>
>
>
> Steve Greene
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


anshumg at gmail

Sep 8, 2009, 10:07 AM

Post #4 of 5 (647 views)
Permalink
Re: large document with multiple fields performance [In reply to]

Hey Steve,

I'd suggest you go with the 20 fields (Non normalized) model. I've used much
larger models and they happen to work just fine. Wouldnt be a point
increasing the complexity.
Hope that clarifies things a little atleast :)
--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com

The facts expressed here belong to everybody, the opinions to me. The
distinction is yours to draw............


On Tue, Sep 8, 2009 at 6:57 PM, Stephen Greene
<SGreene [at] metalseconomics>wrote:

> Hi Anshum,
>
> Thank you for your reply. I have two options I am considering.
> One would be:
> Document {
> String projectID;
> String generalComment;
> String workHistoryComment;
> String environmentalComment;
> String claimsComment;
> ...
> }
>
> And the document may contain upwards of 20 comment fields.
>
> The other option would be to normalize the data
> Document {
> String projectID;
> String commentType;
> String comment;
> }
>
> I will need to return only the projectID for all found documents. I have
> implemented a custom Collector to capture the projectID for each
> document. Then it occurred to me that I might be better served by the
> normalized document model. But I am wondering which method will have
> better performance: possibly returning 20 documents per hit, or having
> to search 20 fields per document? (This also has implications for the
> query, as each search term will always search all fields, this is
> somewhat easier in the normalized example as opposed to creating 20 "or"
> queries.)
>
> Thanks,
>
> Steve
>
> -----Original Message-----
> From: Anshum [mailto:anshumg [at] gmail]
> Sent: Tuesday, September 08, 2009 9:47 AM
> To: java-user [at] lucene
> Subject: Re: large document with multiple fields performance
>
> Hi Stephen,
> Could you clarify more on the requirement. Do you intend to have data in
> index as:
> Document{
> String Comment;
> String CommentId;
> String ProjectId;
> }
>
> How do you intend to index it.. as in the doc structure? Is there a
> primary
> key there? What would you search on? What would you want to have as the
> result?
> All said and done, its not really an overhead as long as the number of
> fields is within normal bounds.
>
>
> --
> Anshum Gupta
> Naukri Labs!
> http://ai-cafe.blogspot.com
>
> The facts expressed here belong to everybody, the opinions to me. The
> distinction is yours to draw............
>
>
> On Tue, Sep 8, 2009 at 5:27 PM, Stephen Greene
> <SGreene [at] metalseconomics>wrote:
>
> > Hello,
> >
> >
> >
> > I am new to lucene and building an application which requires
> documents
> > with many fields to be searched.
> >
> > A "project" id is being stored (not_analyzed) and all matching project
> > ids will be returned to be used to join other data from a database.
> >
> > Will it provide better performance to store each comment field in a
> > separate document with the project ID and a comment ID or to store all
> > the comments for a single project in a single document with multiple
> > fields?
> >
> >
> >
> > Thanks,
> >
> >
> >
> > Steve Greene
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


SGreene at metalseconomics

Sep 13, 2009, 6:48 PM

Post #5 of 5 (562 views)
Permalink
RE: large document with multiple fields performance [In reply to]

Hi Anshum,

Thanks for your insight. I will stick with the 20 fields.
I realized that I had neglected to mention that in a separate query I
will search on the primary key and a search term to return details about
how many hits come from each field. Is it safe to assume that this will
also not be a problem and implementing a custom hitcollector will do the
trick?

Thanks again,

Steve

-----Original Message-----
From: Anshum [mailto:anshumg [at] gmail]
Sent: Tuesday, September 08, 2009 2:08 PM
To: java-user [at] lucene
Subject: Re: large document with multiple fields performance

Hey Steve,

I'd suggest you go with the 20 fields (Non normalized) model. I've used
much
larger models and they happen to work just fine. Wouldnt be a point
increasing the complexity.
Hope that clarifies things a little atleast :)
--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com

The facts expressed here belong to everybody, the opinions to me. The
distinction is yours to draw............


On Tue, Sep 8, 2009 at 6:57 PM, Stephen Greene
<SGreene [at] metalseconomics>wrote:

> Hi Anshum,
>
> Thank you for your reply. I have two options I am considering.
> One would be:
> Document {
> String projectID;
> String generalComment;
> String workHistoryComment;
> String environmentalComment;
> String claimsComment;
> ...
> }
>
> And the document may contain upwards of 20 comment fields.
>
> The other option would be to normalize the data
> Document {
> String projectID;
> String commentType;
> String comment;
> }
>
> I will need to return only the projectID for all found documents. I
have
> implemented a custom Collector to capture the projectID for each
> document. Then it occurred to me that I might be better served by the
> normalized document model. But I am wondering which method will have
> better performance: possibly returning 20 documents per hit, or having
> to search 20 fields per document? (This also has implications for the
> query, as each search term will always search all fields, this is
> somewhat easier in the normalized example as opposed to creating 20
"or"
> queries.)
>
> Thanks,
>
> Steve
>
> -----Original Message-----
> From: Anshum [mailto:anshumg [at] gmail]
> Sent: Tuesday, September 08, 2009 9:47 AM
> To: java-user [at] lucene
> Subject: Re: large document with multiple fields performance
>
> Hi Stephen,
> Could you clarify more on the requirement. Do you intend to have data
in
> index as:
> Document{
> String Comment;
> String CommentId;
> String ProjectId;
> }
>
> How do you intend to index it.. as in the doc structure? Is there a
> primary
> key there? What would you search on? What would you want to have as
the
> result?
> All said and done, its not really an overhead as long as the number of
> fields is within normal bounds.
>
>
> --
> Anshum Gupta
> Naukri Labs!
> http://ai-cafe.blogspot.com
>
> The facts expressed here belong to everybody, the opinions to me. The
> distinction is yours to draw............
>
>
> On Tue, Sep 8, 2009 at 5:27 PM, Stephen Greene
> <SGreene [at] metalseconomics>wrote:
>
> > Hello,
> >
> >
> >
> > I am new to lucene and building an application which requires
> documents
> > with many fields to be searched.
> >
> > A "project" id is being stored (not_analyzed) and all matching
project
> > ids will be returned to be used to join other data from a database.
> >
> > Will it provide better performance to store each comment field in a
> > separate document with the project ID and a comment ID or to store
all
> > the comments for a single project in a single document with multiple
> > fields?
> >
> >
> >
> > Thanks,
> >
> >
> >
> > Steve Greene
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.