Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Is storing 20 fields in a lucene document desirable?

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


kumarlimbu at gmail

Nov 20, 2007, 3:29 AM

Post #1 of 3 (234 views)
Permalink
Is storing 20 fields in a lucene document desirable?

Our document contains a total of 23 fields in one document and we STORE all
of them in lucene index.

We have recently had some performance issues and our analysis has shown the
bottleneck to be lucene search and retrieval.

We have been thinking about reducing the number of fields per document by
removing unnecessary fields and by merging fields with similar weightings.
Will reducing the number of fields help to optimize performance?

Another issue is we are currently retrieving around 9 fields after we do a
search. Some are long text of up to 1000 words. Is it a large overhead to
retrieve long fields?

We are considering the option of separating the search and retrieve parts so
that Lucene performs the search, MYSQL stores the data. We just store the
INDEXED field and primary key in the lucene index. After searching we only
return 1 field (primary key) instead of 9 fields. This field will be used
for retrieving the actual information from the MySQL database. Will reducing
the number of fields retrieved from lucene reduce the response time or will
using MySQL database make it worse?

So our main concern is to find if retrieving fields usually takes longer
than searching or not? What does lucene spend most time doing – search or
retrieval? We are also concerned that using MySQL will have performance
issues because we will be doing I/O for MySQL as well as Lucene. We also add
around 100k documents each day and remove around the same number of
documents. Will this frequent read and write have impact on performance?

Our current index size is around 8G and contains around 3M documents.

If there are any suggestions or questions, please reply.

Thanks you all!

Regards,
Kumar

--
View this message in context: http://www.nabble.com/Is-storing-20-fields-in-a-lucene-document-desirable--tf4842930.html#a13855272
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


gsingers at apache

Nov 20, 2007, 4:21 AM

Post #2 of 3 (226 views)
Permalink
Re: Is storing 20 fields in a lucene document desirable? [In reply to]

On Nov 20, 2007, at 6:29 AM, kumarlimbu wrote:

>
> Our document contains a total of 23 fields in one document and we
> STORE all
> of them in lucene index.
>
> We have recently had some performance issues and our analysis has
> shown the
> bottleneck to be lucene search and retrieval.

Perhaps you can share your information on java-dev along with any
detailed tests, etc. so that we can see if there is anything we can
improve.

>
>
> We have been thinking about reducing the number of fields per
> document by
> removing unnecessary fields and by merging fields with similar
> weightings.
> Will reducing the number of fields help to optimize performance?

Yes

>
>
> Another issue is we are currently retrieving around 9 fields after
> we do a
> search. Some are long text of up to 1000 words. Is it a large
> overhead to
> retrieve long fields?

Yes. Are you using FieldSelector? Also, I would only STORE those
fields that you actually need to display, not all 23. Do you display
all 9 fields right away or are some only when you choose a document?
If so, try the Lazy Loading piece of FieldSelector.

>
>
> We are considering the option of separating the search and retrieve
> parts so
> that Lucene performs the search, MYSQL stores the data. We just
> store the
> INDEXED field and primary key in the lucene index. After searching
> we only
> return 1 field (primary key) instead of 9 fields. This field will be
> used
> for retrieving the actual information from the MySQL database. Will
> reducing
> the number of fields retrieved from lucene reduce the response time
> or will
> using MySQL database make it worse?

Hard to say, you will likely have to try it out.

>
>
> So our main concern is to find if retrieving fields usually takes
> longer
> than searching or not? What does lucene spend most time doing –
> search or
> retrieval? We are also concerned that using MySQL will have
> performance
> issues because we will be doing I/O for MySQL as well as Lucene. We
> also add
> around 100k documents each day and remove around the same number of
> documents. Will this frequent read and write have impact on
> performance?
>

What version of Lucene are you using? Naturally, if you are updating
things that will have an effect, but that may not be a factor. I
would check out http://wiki.apache.org/lucene-java/BasicsOfPerformance

Also, you may want to try out (not in production) the trunk version of
Lucene.

Last, but not least, do you have some sort of cache for your
Documents? Obviously, you need to have appropriate semantics for when
the underlying docs change, but using a cache makes sense, too.

-Grant


--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


erickerickson at gmail

Nov 20, 2007, 7:56 AM

Post #3 of 3 (224 views)
Permalink
Re: Is storing 20 fields in a lucene document desirable? [In reply to]

How are you doing your search? When you say "lucene is the
bottleneck", that encompasses a lot. You really need to pinpoint things
a bit more....

1> are you iterating over the hits object for many docs? This is bad.

2> are you using a HitCollector and reading the doc each time you get
to the collect method? This is bad.

3> Is it the *search* or the fetch that's eating the time? Time *just* the
search call (making sure it doesn't go through a HitCollector) to answer
this.

4> How many documents are you returning? 5? 5,000? the answer may
tell us a lot.

5> are you opening a new reader each time you search? this is bad.

6> Have you looked over
http://wiki.apache.org/lucene-java/ImproveSearchingSpeed?

Before you try and implement an solution, you *must* figure out what
operation is
actually taking the time. an 8G index isn't that big, so I suspect that
there's
something else going on. What kind of throughput are you seeing? What
do you need? Can you post some sample timings? Can you post some sample
search code?

I can't tell you how many times I've been *sure* I knew what the problem
was, but
in the process of analyzing the problem discovered that my real issue was
something totally different.....

Best
Erick

On Nov 20, 2007 7:21 AM, Grant Ingersoll <gsingers[at]apache.org> wrote:

>
> On Nov 20, 2007, at 6:29 AM, kumarlimbu wrote:
>
> >
> > Our document contains a total of 23 fields in one document and we
> > STORE all
> > of them in lucene index.
> >
> > We have recently had some performance issues and our analysis has
> > shown the
> > bottleneck to be lucene search and retrieval.
>
> Perhaps you can share your information on java-dev along with any
> detailed tests, etc. so that we can see if there is anything we can
> improve.
>
> >
> >
> > We have been thinking about reducing the number of fields per
> > document by
> > removing unnecessary fields and by merging fields with similar
> > weightings.
> > Will reducing the number of fields help to optimize performance?
>
> Yes
>
> >
> >
> > Another issue is we are currently retrieving around 9 fields after
> > we do a
> > search. Some are long text of up to 1000 words. Is it a large
> > overhead to
> > retrieve long fields?
>
> Yes. Are you using FieldSelector? Also, I would only STORE those
> fields that you actually need to display, not all 23. Do you display
> all 9 fields right away or are some only when you choose a document?
> If so, try the Lazy Loading piece of FieldSelector.
>
> >
> >
> > We are considering the option of separating the search and retrieve
> > parts so
> > that Lucene performs the search, MYSQL stores the data. We just
> > store the
> > INDEXED field and primary key in the lucene index. After searching
> > we only
> > return 1 field (primary key) instead of 9 fields. This field will be
> > used
> > for retrieving the actual information from the MySQL database. Will
> > reducing
> > the number of fields retrieved from lucene reduce the response time
> > or will
> > using MySQL database make it worse?
>
> Hard to say, you will likely have to try it out.
>
> >
> >
> > So our main concern is to find if retrieving fields usually takes
> > longer
> > than searching or not? What does lucene spend most time doing –
> > search or
> > retrieval? We are also concerned that using MySQL will have
> > performance
> > issues because we will be doing I/O for MySQL as well as Lucene. We
> > also add
> > around 100k documents each day and remove around the same number of
> > documents. Will this frequent read and write have impact on
> > performance?
> >
>
> What version of Lucene are you using? Naturally, if you are updating
> things that will have an effect, but that may not be a factor. I
> would check out http://wiki.apache.org/lucene-java/BasicsOfPerformance
>
> Also, you may want to try out (not in production) the trunk version of
> Lucene.
>
> Last, but not least, do you have some sort of cache for your
> Documents? Obviously, you need to have appropriate semantics for when
> the underlying docs change, but using a cache makes sense, too.
>
> -Grant
>
>
> --------------------------
> Grant Ingersoll
> http://lucene.grantingersoll.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>
>

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.