Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

What is the best way to handle the primary key case during lucene indexing

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


java8964 at hotmail

Nov 16, 2009, 9:15 AM

Post #1 of 8 (860 views)
Permalink
What is the best way to handle the primary key case during lucene indexing

Hi,

In our application, we will allow the user to create a primary key defined in the document. We are using lucene 2.9.
In this case, when we index the data coming from the client, if the metadata contains the primary key defined,
we have to do the search/update for every row based on the primary key.

Here is our current problems:

1) If the meta data coming from client defined a primary key (which can contain one or multi fields),
then for the data supplied from the client, we have to make sure that later row will override the previous row, if they have the same primary key as the data.
2) To do the above, we have to loop through the data first, to check if any later rows containing the same PK as the previous rows, so we will build the MAP in the memory to override the previous one by the latest ones.
This is a very expensive operation.
3) Even in this case, for every row after the above filter steps, we still have to search the current index to see if any data with the same PK exist or not. So we have to do the remove before we add the new data in the index.

I want to know if anyone has the same requirement like this PK using the lucene? What is the best way to index data in this case?

First, I am thinking if it is possible to remove the above step2?
the problem for the lucene is that when we add document in the index, we can NOT search it before commit it.
But we only commit once when the whole data file is finished. So we have to loop through the data once to check to see if any data sharing the same PK in the data file.
I am wondering if there is a way in the index writer, before it commits anything, when we add the new document into it, it can do the merging of the PK data? What I mean is that if the same PK data already exist in any previous added document, just remove it and let the new added data containing the same PK data take the place? If we can do this, then the whole pre checking data step can be removed.

Second, for the above step 3, if the searching the existing index is NOT avoidable, what is the fast way to search by the PK? Of course we already indexed all the PK fields. When we add new data, we have to search every row of existing index by the PK fields, to see if it exist or not. If it does, remove it and add the new one.
We constructor the query by the PK fields at run time, then search it row by row. This is also very bad as the indexing the data for performance.

Here is what I am thinking?
1) Can I use the Indexreader.term(terms)? I heard it is much faster than the query searching? Is that right?
2) Currently we are do the search row by row? Should I do it in batching? Like I will combine 100 PK search into one search, using Boolean term? So one search will give me back all the data in this 100 PK which are in the index. Then I can remove them from the index using the result set. In this case, I only need to do 1/100 search requests as before? This will much faster than row by row in theory.


Please let me know any feedbacks? If you ever dealed with PK data support, please share some thougths and experience.

Thanks for your kind help.

_________________________________________________________________
Hotmail: Free, trusted and rich email service.
http://clk.atdmt.com/GBL/go/171222984/direct/01/


jake.mannix at gmail

Nov 16, 2009, 9:44 AM

Post #2 of 8 (814 views)
Permalink
Re: What is the best way to handle the primary key case during lucene indexing [In reply to]

The usual way to do this is to use:

IndexWriter.updateDocument(Term, Document)

This method deletes all documents with the given Term in it (this would be
your primary key), and then adds the Document you want to add. This is the
traditional way to do updates, and it is fast.

-jake



On Mon, Nov 16, 2009 at 9:15 AM, java8964 java8964 <java8964 [at] hotmail>wrote:

>
> Hi,
>
> In our application, we will allow the user to create a primary key defined
> in the document. We are using lucene 2.9.
> In this case, when we index the data coming from the client, if the
> metadata contains the primary key defined,
> we have to do the search/update for every row based on the primary key.
>
> Here is our current problems:
>
> 1) If the meta data coming from client defined a primary key (which can
> contain one or multi fields),
> then for the data supplied from the client, we have to make sure that
> later row will override the previous row, if they have the same primary key
> as the data.
> 2) To do the above, we have to loop through the data first, to check if any
> later rows containing the same PK as the previous rows, so we will build the
> MAP in the memory to override the previous one by the latest ones.
> This is a very expensive operation.
> 3) Even in this case, for every row after the above filter steps, we still
> have to search the current index to see if any data with the same PK exist
> or not. So we have to do the remove before we add the new data in the index.
>
> I want to know if anyone has the same requirement like this PK using the
> lucene? What is the best way to index data in this case?
>
> First, I am thinking if it is possible to remove the above step2?
> the problem for the lucene is that when we add document in the index, we
> can NOT search it before commit it.
> But we only commit once when the whole data file is finished. So we have to
> loop through the data once to check to see if any data sharing the same PK
> in the data file.
> I am wondering if there is a way in the index writer, before it commits
> anything, when we add the new document into it, it can do the merging of the
> PK data? What I mean is that if the same PK data already exist in any
> previous added document, just remove it and let the new added data
> containing the same PK data take the place? If we can do this, then the
> whole pre checking data step can be removed.
>
> Second, for the above step 3, if the searching the existing index is NOT
> avoidable, what is the fast way to search by the PK? Of course we already
> indexed all the PK fields. When we add new data, we have to search every row
> of existing index by the PK fields, to see if it exist or not. If it does,
> remove it and add the new one.
> We constructor the query by the PK fields at run time, then search it row
> by row. This is also very bad as the indexing the data for performance.
>
> Here is what I am thinking?
> 1) Can I use the Indexreader.term(terms)? I heard it is much faster than
> the query searching? Is that right?
> 2) Currently we are do the search row by row? Should I do it in batching?
> Like I will combine 100 PK search into one search, using Boolean term? So
> one search will give me back all the data in this 100 PK which are in the
> index. Then I can remove them from the index using the result set. In this
> case, I only need to do 1/100 search requests as before? This will much
> faster than row by row in theory.
>
>
> Please let me know any feedbacks? If you ever dealed with PK data support,
> please share some thougths and experience.
>
> Thanks for your kind help.
>
> _________________________________________________________________
> Hotmail: Free, trusted and rich email service.
> http://clk.atdmt.com/GBL/go/171222984/direct/01/
>


erickerickson at gmail

Nov 16, 2009, 9:45 AM

Post #3 of 8 (817 views)
Permalink
Re: What is the best way to handle the primary key case during lucene indexing [In reply to]

What is the form of the unique key? I'm a bit confused here by your comment:
"which can contain one or multi fields".

But it seems like IndexWriter.deleteDocuments should work here. It's easy
if your PKs are single terms, there's even a deleteDocuments(Term[]) form.
But this really *requires* that your PKs are single terms in a field. If
your PKs
are some sort of composite field, perhaps the iw.DeleteDocuments(Query[])
would help where each query is enough to uniquely identify your document.

Best
Erick

On Mon, Nov 16, 2009 at 12:15 PM, java8964 java8964 <java8964 [at] hotmail>wrote:

>
> Hi,
>
> In our application, we will allow the user to create a primary key defined
> in the document. We are using lucene 2.9.
> In this case, when we index the data coming from the client, if the
> metadata contains the primary key defined,
> we have to do the search/update for every row based on the primary key.
>
> Here is our current problems:
>
> 1) If the meta data coming from client defined a primary key (which can
> contain one or multi fields),
> then for the data supplied from the client, we have to make sure that
> later row will override the previous row, if they have the same primary key
> as the data.
> 2) To do the above, we have to loop through the data first, to check if any
> later rows containing the same PK as the previous rows, so we will build the
> MAP in the memory to override the previous one by the latest ones.
> This is a very expensive operation.
> 3) Even in this case, for every row after the above filter steps, we still
> have to search the current index to see if any data with the same PK exist
> or not. So we have to do the remove before we add the new data in the index.
>
> I want to know if anyone has the same requirement like this PK using the
> lucene? What is the best way to index data in this case?
>
> First, I am thinking if it is possible to remove the above step2?
> the problem for the lucene is that when we add document in the index, we
> can NOT search it before commit it.
> But we only commit once when the whole data file is finished. So we have to
> loop through the data once to check to see if any data sharing the same PK
> in the data file.
> I am wondering if there is a way in the index writer, before it commits
> anything, when we add the new document into it, it can do the merging of the
> PK data? What I mean is that if the same PK data already exist in any
> previous added document, just remove it and let the new added data
> containing the same PK data take the place? If we can do this, then the
> whole pre checking data step can be removed.
>
> Second, for the above step 3, if the searching the existing index is NOT
> avoidable, what is the fast way to search by the PK? Of course we already
> indexed all the PK fields. When we add new data, we have to search every row
> of existing index by the PK fields, to see if it exist or not. If it does,
> remove it and add the new one.
> We constructor the query by the PK fields at run time, then search it row
> by row. This is also very bad as the indexing the data for performance.
>
> Here is what I am thinking?
> 1) Can I use the Indexreader.term(terms)? I heard it is much faster than
> the query searching? Is that right?
> 2) Currently we are do the search row by row? Should I do it in batching?
> Like I will combine 100 PK search into one search, using Boolean term? So
> one search will give me back all the data in this 100 PK which are in the
> index. Then I can remove them from the index using the result set. In this
> case, I only need to do 1/100 search requests as before? This will much
> faster than row by row in theory.
>
>
> Please let me know any feedbacks? If you ever dealed with PK data support,
> please share some thougths and experience.
>
> Thanks for your kind help.
>
> _________________________________________________________________
> Hotmail: Free, trusted and rich email service.
> http://clk.atdmt.com/GBL/go/171222984/direct/01/
>


erickerickson at gmail

Nov 16, 2009, 9:46 AM

Post #4 of 8 (821 views)
Permalink
Re: What is the best way to handle the primary key case during lucene indexing [In reply to]

Sorry, forgot to add "then re-add the documents in question".

On Mon, Nov 16, 2009 at 12:45 PM, Erick Erickson <erickerickson [at] gmail>wrote:

> What is the form of the unique key? I'm a bit confused here by your
> comment:
> "which can contain one or multi fields".
>
> But it seems like IndexWriter.deleteDocuments should work here. It's easy
> if your PKs are single terms, there's even a deleteDocuments(Term[]) form.
> But this really *requires* that your PKs are single terms in a field. If
> your PKs
> are some sort of composite field, perhaps the iw.DeleteDocuments(Query[])
> would help where each query is enough to uniquely identify your document.
>
> Best
> Erick
>
>
> On Mon, Nov 16, 2009 at 12:15 PM, java8964 java8964 <java8964 [at] hotmail>wrote:
>
>>
>> Hi,
>>
>> In our application, we will allow the user to create a primary key defined
>> in the document. We are using lucene 2.9.
>> In this case, when we index the data coming from the client, if the
>> metadata contains the primary key defined,
>> we have to do the search/update for every row based on the primary key.
>>
>> Here is our current problems:
>>
>> 1) If the meta data coming from client defined a primary key (which can
>> contain one or multi fields),
>> then for the data supplied from the client, we have to make sure that
>> later row will override the previous row, if they have the same primary key
>> as the data.
>> 2) To do the above, we have to loop through the data first, to check if
>> any later rows containing the same PK as the previous rows, so we will build
>> the MAP in the memory to override the previous one by the latest ones.
>> This is a very expensive operation.
>> 3) Even in this case, for every row after the above filter steps, we still
>> have to search the current index to see if any data with the same PK exist
>> or not. So we have to do the remove before we add the new data in the index.
>>
>> I want to know if anyone has the same requirement like this PK using the
>> lucene? What is the best way to index data in this case?
>>
>> First, I am thinking if it is possible to remove the above step2?
>> the problem for the lucene is that when we add document in the index, we
>> can NOT search it before commit it.
>> But we only commit once when the whole data file is finished. So we have
>> to loop through the data once to check to see if any data sharing the same
>> PK in the data file.
>> I am wondering if there is a way in the index writer, before it commits
>> anything, when we add the new document into it, it can do the merging of the
>> PK data? What I mean is that if the same PK data already exist in any
>> previous added document, just remove it and let the new added data
>> containing the same PK data take the place? If we can do this, then the
>> whole pre checking data step can be removed.
>>
>> Second, for the above step 3, if the searching the existing index is NOT
>> avoidable, what is the fast way to search by the PK? Of course we already
>> indexed all the PK fields. When we add new data, we have to search every row
>> of existing index by the PK fields, to see if it exist or not. If it does,
>> remove it and add the new one.
>> We constructor the query by the PK fields at run time, then search it row
>> by row. This is also very bad as the indexing the data for performance.
>>
>> Here is what I am thinking?
>> 1) Can I use the Indexreader.term(terms)? I heard it is much faster than
>> the query searching? Is that right?
>> 2) Currently we are do the search row by row? Should I do it in batching?
>> Like I will combine 100 PK search into one search, using Boolean term? So
>> one search will give me back all the data in this 100 PK which are in the
>> index. Then I can remove them from the index using the result set. In this
>> case, I only need to do 1/100 search requests as before? This will much
>> faster than row by row in theory.
>>
>>
>> Please let me know any feedbacks? If you ever dealed with PK data support,
>> please share some thougths and experience.
>>
>> Thanks for your kind help.
>>
>> _________________________________________________________________
>> Hotmail: Free, trusted and rich email service.
>> http://clk.atdmt.com/GBL/go/171222984/direct/01/
>>
>
>


java8964 at hotmail

Nov 16, 2009, 11:07 AM

Post #5 of 8 (812 views)
Permalink
RE: What is the best way to handle the primary key case during lucene indexing [In reply to]

What I mean is that for one index, client can defined multi field in the index as the primary key (composite key).
> Date: Mon, 16 Nov 2009 12:45:40 -0500
> Subject: Re: What is the best way to handle the primary key case during lucene indexing
> From: erickerickson [at] gmail
> To: java-user [at] lucene
>
> What is the form of the unique key? I'm a bit confused here by your comment:
> "which can contain one or multi fields".
>
> But it seems like IndexWriter.deleteDocuments should work here. It's easy
> if your PKs are single terms, there's even a deleteDocuments(Term[]) form.
> But this really *requires* that your PKs are single terms in a field. If
> your PKs
> are some sort of composite field, perhaps the iw.DeleteDocuments(Query[])
> would help where each query is enough to uniquely identify your document.
>
> Best
> Erick
>
> On Mon, Nov 16, 2009 at 12:15 PM, java8964 java8964 <java8964 [at] hotmail>wrote:
>
> >
> > Hi,
> >
> > In our application, we will allow the user to create a primary key defined
> > in the document. We are using lucene 2.9.
> > In this case, when we index the data coming from the client, if the
> > metadata contains the primary key defined,
> > we have to do the search/update for every row based on the primary key.
> >
> > Here is our current problems:
> >
> > 1) If the meta data coming from client defined a primary key (which can
> > contain one or multi fields),
> > then for the data supplied from the client, we have to make sure that
> > later row will override the previous row, if they have the same primary key
> > as the data.
> > 2) To do the above, we have to loop through the data first, to check if any
> > later rows containing the same PK as the previous rows, so we will build the
> > MAP in the memory to override the previous one by the latest ones.
> > This is a very expensive operation.
> > 3) Even in this case, for every row after the above filter steps, we still
> > have to search the current index to see if any data with the same PK exist
> > or not. So we have to do the remove before we add the new data in the index.
> >
> > I want to know if anyone has the same requirement like this PK using the
> > lucene? What is the best way to index data in this case?
> >
> > First, I am thinking if it is possible to remove the above step2?
> > the problem for the lucene is that when we add document in the index, we
> > can NOT search it before commit it.
> > But we only commit once when the whole data file is finished. So we have to
> > loop through the data once to check to see if any data sharing the same PK
> > in the data file.
> > I am wondering if there is a way in the index writer, before it commits
> > anything, when we add the new document into it, it can do the merging of the
> > PK data? What I mean is that if the same PK data already exist in any
> > previous added document, just remove it and let the new added data
> > containing the same PK data take the place? If we can do this, then the
> > whole pre checking data step can be removed.
> >
> > Second, for the above step 3, if the searching the existing index is NOT
> > avoidable, what is the fast way to search by the PK? Of course we already
> > indexed all the PK fields. When we add new data, we have to search every row
> > of existing index by the PK fields, to see if it exist or not. If it does,
> > remove it and add the new one.
> > We constructor the query by the PK fields at run time, then search it row
> > by row. This is also very bad as the indexing the data for performance.
> >
> > Here is what I am thinking?
> > 1) Can I use the Indexreader.term(terms)? I heard it is much faster than
> > the query searching? Is that right?
> > 2) Currently we are do the search row by row? Should I do it in batching?
> > Like I will combine 100 PK search into one search, using Boolean term? So
> > one search will give me back all the data in this 100 PK which are in the
> > index. Then I can remove them from the index using the result set. In this
> > case, I only need to do 1/100 search requests as before? This will much
> > faster than row by row in theory.
> >
> >
> > Please let me know any feedbacks? If you ever dealed with PK data support,
> > please share some thougths and experience.
> >
> > Thanks for your kind help.
> >
> > _________________________________________________________________
> > Hotmail: Free, trusted and rich email service.
> > http://clk.atdmt.com/GBL/go/171222984/direct/01/
> >

_________________________________________________________________
Hotmail: Free, trusted and rich email service.
http://clk.atdmt.com/GBL/go/171222984/direct/01/


java8964 at hotmail

Nov 16, 2009, 11:09 AM

Post #6 of 8 (806 views)
Permalink
RE: What is the best way to handle the primary key case during lucene indexing [In reply to]

But can IndexWriter.updateDocument(Term, Document) handle the composite key case?

If my primary key contains field1 and field2, can I use one Term to include both field1 and field2?

Thanks

> Date: Mon, 16 Nov 2009 09:44:35 -0800
> Subject: Re: What is the best way to handle the primary key case during lucene indexing
> From: jake.mannix [at] gmail
> To: java-user [at] lucene
>
> The usual way to do this is to use:
>
> IndexWriter.updateDocument(Term, Document)
>
> This method deletes all documents with the given Term in it (this would be
> your primary key), and then adds the Document you want to add. This is the
> traditional way to do updates, and it is fast.
>
> -jake
>
>
>
> On Mon, Nov 16, 2009 at 9:15 AM, java8964 java8964 <java8964 [at] hotmail>wrote:
>
> >
> > Hi,
> >
> > In our application, we will allow the user to create a primary key defined
> > in the document. We are using lucene 2.9.
> > In this case, when we index the data coming from the client, if the
> > metadata contains the primary key defined,
> > we have to do the search/update for every row based on the primary key.
> >
> > Here is our current problems:
> >
> > 1) If the meta data coming from client defined a primary key (which can
> > contain one or multi fields),
> > then for the data supplied from the client, we have to make sure that
> > later row will override the previous row, if they have the same primary key
> > as the data.
> > 2) To do the above, we have to loop through the data first, to check if any
> > later rows containing the same PK as the previous rows, so we will build the
> > MAP in the memory to override the previous one by the latest ones.
> > This is a very expensive operation.
> > 3) Even in this case, for every row after the above filter steps, we still
> > have to search the current index to see if any data with the same PK exist
> > or not. So we have to do the remove before we add the new data in the index.
> >
> > I want to know if anyone has the same requirement like this PK using the
> > lucene? What is the best way to index data in this case?
> >
> > First, I am thinking if it is possible to remove the above step2?
> > the problem for the lucene is that when we add document in the index, we
> > can NOT search it before commit it.
> > But we only commit once when the whole data file is finished. So we have to
> > loop through the data once to check to see if any data sharing the same PK
> > in the data file.
> > I am wondering if there is a way in the index writer, before it commits
> > anything, when we add the new document into it, it can do the merging of the
> > PK data? What I mean is that if the same PK data already exist in any
> > previous added document, just remove it and let the new added data
> > containing the same PK data take the place? If we can do this, then the
> > whole pre checking data step can be removed.
> >
> > Second, for the above step 3, if the searching the existing index is NOT
> > avoidable, what is the fast way to search by the PK? Of course we already
> > indexed all the PK fields. When we add new data, we have to search every row
> > of existing index by the PK fields, to see if it exist or not. If it does,
> > remove it and add the new one.
> > We constructor the query by the PK fields at run time, then search it row
> > by row. This is also very bad as the indexing the data for performance.
> >
> > Here is what I am thinking?
> > 1) Can I use the Indexreader.term(terms)? I heard it is much faster than
> > the query searching? Is that right?
> > 2) Currently we are do the search row by row? Should I do it in batching?
> > Like I will combine 100 PK search into one search, using Boolean term? So
> > one search will give me back all the data in this 100 PK which are in the
> > index. Then I can remove them from the index using the result set. In this
> > case, I only need to do 1/100 search requests as before? This will much
> > faster than row by row in theory.
> >
> >
> > Please let me know any feedbacks? If you ever dealed with PK data support,
> > please share some thougths and experience.
> >
> > Thanks for your kind help.
> >
> > _________________________________________________________________
> > Hotmail: Free, trusted and rich email service.
> > http://clk.atdmt.com/GBL/go/171222984/direct/01/
> >

_________________________________________________________________
Hotmail: Powerful Free email with security by Microsoft.
http://clk.atdmt.com/GBL/go/171222986/direct/01/


jake.mannix at gmail

Nov 16, 2009, 12:00 PM

Post #7 of 8 (809 views)
Permalink
Re: What is the best way to handle the primary key case during lucene indexing [In reply to]

You will want to have one Lucene field which contains this composite key -
they could
be the un-tokenized concatenation of all of the subkeys, for example, and
then one Term
would have the full composite key, and the updateDocument technique would
work fine.

-jake

On Mon, Nov 16, 2009 at 11:09 AM, java8964 java8964 <java8964 [at] hotmail>wrote:

>
> But can IndexWriter.updateDocument(Term, Document) handle the composite key
> case?
>
> If my primary key contains field1 and field2, can I use one Term to include
> both field1 and field2?
>
> Thanks
>
> > Date: Mon, 16 Nov 2009 09:44:35 -0800
> > Subject: Re: What is the best way to handle the primary key case during
> lucene indexing
> > From: jake.mannix [at] gmail
> > To: java-user [at] lucene
> >
> > The usual way to do this is to use:
> >
> > IndexWriter.updateDocument(Term, Document)
> >
> > This method deletes all documents with the given Term in it (this would
> be
> > your primary key), and then adds the Document you want to add. This is
> the
> > traditional way to do updates, and it is fast.
> >
> > -jake
> >
> >
> >
> > On Mon, Nov 16, 2009 at 9:15 AM, java8964 java8964 <java8964 [at] hotmail
> >wrote:
> >
> > >
> > > Hi,
> > >
> > > In our application, we will allow the user to create a primary key
> defined
> > > in the document. We are using lucene 2.9.
> > > In this case, when we index the data coming from the client, if the
> > > metadata contains the primary key defined,
> > > we have to do the search/update for every row based on the primary key.
> > >
> > > Here is our current problems:
> > >
> > > 1) If the meta data coming from client defined a primary key (which can
> > > contain one or multi fields),
> > > then for the data supplied from the client, we have to make sure
> that
> > > later row will override the previous row, if they have the same primary
> key
> > > as the data.
> > > 2) To do the above, we have to loop through the data first, to check if
> any
> > > later rows containing the same PK as the previous rows, so we will
> build the
> > > MAP in the memory to override the previous one by the latest ones.
> > > This is a very expensive operation.
> > > 3) Even in this case, for every row after the above filter steps, we
> still
> > > have to search the current index to see if any data with the same PK
> exist
> > > or not. So we have to do the remove before we add the new data in the
> index.
> > >
> > > I want to know if anyone has the same requirement like this PK using
> the
> > > lucene? What is the best way to index data in this case?
> > >
> > > First, I am thinking if it is possible to remove the above step2?
> > > the problem for the lucene is that when we add document in the index,
> we
> > > can NOT search it before commit it.
> > > But we only commit once when the whole data file is finished. So we
> have to
> > > loop through the data once to check to see if any data sharing the same
> PK
> > > in the data file.
> > > I am wondering if there is a way in the index writer, before it commits
> > > anything, when we add the new document into it, it can do the merging
> of the
> > > PK data? What I mean is that if the same PK data already exist in any
> > > previous added document, just remove it and let the new added data
> > > containing the same PK data take the place? If we can do this, then the
> > > whole pre checking data step can be removed.
> > >
> > > Second, for the above step 3, if the searching the existing index is
> NOT
> > > avoidable, what is the fast way to search by the PK? Of course we
> already
> > > indexed all the PK fields. When we add new data, we have to search
> every row
> > > of existing index by the PK fields, to see if it exist or not. If it
> does,
> > > remove it and add the new one.
> > > We constructor the query by the PK fields at run time, then search it
> row
> > > by row. This is also very bad as the indexing the data for performance.
> > >
> > > Here is what I am thinking?
> > > 1) Can I use the Indexreader.term(terms)? I heard it is much faster
> than
> > > the query searching? Is that right?
> > > 2) Currently we are do the search row by row? Should I do it in
> batching?
> > > Like I will combine 100 PK search into one search, using Boolean term?
> So
> > > one search will give me back all the data in this 100 PK which are in
> the
> > > index. Then I can remove them from the index using the result set. In
> this
> > > case, I only need to do 1/100 search requests as before? This will much
> > > faster than row by row in theory.
> > >
> > >
> > > Please let me know any feedbacks? If you ever dealed with PK data
> support,
> > > please share some thougths and experience.
> > >
> > > Thanks for your kind help.
> > >
> > > _________________________________________________________________
> > > Hotmail: Free, trusted and rich email service.
> > > http://clk.atdmt.com/GBL/go/171222984/direct/01/
> > >
>
> _________________________________________________________________
> Hotmail: Powerful Free email with security by Microsoft.
> http://clk.atdmt.com/GBL/go/171222986/direct/01/
>


erickerickson at gmail

Nov 16, 2009, 12:19 PM

Post #8 of 8 (814 views)
Permalink
Re: What is the best way to handle the primary key case during lucene indexing [In reply to]

From your original e-mail "if the metadata contains the primary key defined,
we have to do the search/update for every row based on the primary key".

Jake and I are both assuming that you're using primary key in the database
sense. That is, there is exactly one document in the index with that primary
key, composite or not. Your statement above seems to indicate otherwise,
in which case you should pretty much disregard anything we've said and
you're stuck running the query that assembles the list of hits and updating
each one.

If you really mean a primary key, Jake's suggestion is certainly the
easiest. If
you absolutely *can't* make a single-term primary key for your index, you
could
use the iw.deleteDocuments(Query) form and re-add the document.


Best
Erick

On Mon, Nov 16, 2009 at 3:00 PM, Jake Mannix <jake.mannix [at] gmail> wrote:

> You will want to have one Lucene field which contains this composite key -
> they could
> be the un-tokenized concatenation of all of the subkeys, for example, and
> then one Term
> would have the full composite key, and the updateDocument technique would
> work fine.
>
> -jake
>
> On Mon, Nov 16, 2009 at 11:09 AM, java8964 java8964 <java8964 [at] hotmail
> >wrote:
>
> >
> > But can IndexWriter.updateDocument(Term, Document) handle the composite
> key
> > case?
> >
> > If my primary key contains field1 and field2, can I use one Term to
> include
> > both field1 and field2?
> >
> > Thanks
> >
> > > Date: Mon, 16 Nov 2009 09:44:35 -0800
> > > Subject: Re: What is the best way to handle the primary key case during
> > lucene indexing
> > > From: jake.mannix [at] gmail
> > > To: java-user [at] lucene
> > >
> > > The usual way to do this is to use:
> > >
> > > IndexWriter.updateDocument(Term, Document)
> > >
> > > This method deletes all documents with the given Term in it (this would
> > be
> > > your primary key), and then adds the Document you want to add. This is
> > the
> > > traditional way to do updates, and it is fast.
> > >
> > > -jake
> > >
> > >
> > >
> > > On Mon, Nov 16, 2009 at 9:15 AM, java8964 java8964 <
> java8964 [at] hotmail
> > >wrote:
> > >
> > > >
> > > > Hi,
> > > >
> > > > In our application, we will allow the user to create a primary key
> > defined
> > > > in the document. We are using lucene 2.9.
> > > > In this case, when we index the data coming from the client, if the
> > > > metadata contains the primary key defined,
> > > > we have to do the search/update for every row based on the primary
> key.
> > > >
> > > > Here is our current problems:
> > > >
> > > > 1) If the meta data coming from client defined a primary key (which
> can
> > > > contain one or multi fields),
> > > > then for the data supplied from the client, we have to make sure
> > that
> > > > later row will override the previous row, if they have the same
> primary
> > key
> > > > as the data.
> > > > 2) To do the above, we have to loop through the data first, to check
> if
> > any
> > > > later rows containing the same PK as the previous rows, so we will
> > build the
> > > > MAP in the memory to override the previous one by the latest ones.
> > > > This is a very expensive operation.
> > > > 3) Even in this case, for every row after the above filter steps, we
> > still
> > > > have to search the current index to see if any data with the same PK
> > exist
> > > > or not. So we have to do the remove before we add the new data in the
> > index.
> > > >
> > > > I want to know if anyone has the same requirement like this PK using
> > the
> > > > lucene? What is the best way to index data in this case?
> > > >
> > > > First, I am thinking if it is possible to remove the above step2?
> > > > the problem for the lucene is that when we add document in the index,
> > we
> > > > can NOT search it before commit it.
> > > > But we only commit once when the whole data file is finished. So we
> > have to
> > > > loop through the data once to check to see if any data sharing the
> same
> > PK
> > > > in the data file.
> > > > I am wondering if there is a way in the index writer, before it
> commits
> > > > anything, when we add the new document into it, it can do the merging
> > of the
> > > > PK data? What I mean is that if the same PK data already exist in any
> > > > previous added document, just remove it and let the new added data
> > > > containing the same PK data take the place? If we can do this, then
> the
> > > > whole pre checking data step can be removed.
> > > >
> > > > Second, for the above step 3, if the searching the existing index is
> > NOT
> > > > avoidable, what is the fast way to search by the PK? Of course we
> > already
> > > > indexed all the PK fields. When we add new data, we have to search
> > every row
> > > > of existing index by the PK fields, to see if it exist or not. If it
> > does,
> > > > remove it and add the new one.
> > > > We constructor the query by the PK fields at run time, then search it
> > row
> > > > by row. This is also very bad as the indexing the data for
> performance.
> > > >
> > > > Here is what I am thinking?
> > > > 1) Can I use the Indexreader.term(terms)? I heard it is much faster
> > than
> > > > the query searching? Is that right?
> > > > 2) Currently we are do the search row by row? Should I do it in
> > batching?
> > > > Like I will combine 100 PK search into one search, using Boolean
> term?
> > So
> > > > one search will give me back all the data in this 100 PK which are in
> > the
> > > > index. Then I can remove them from the index using the result set. In
> > this
> > > > case, I only need to do 1/100 search requests as before? This will
> much
> > > > faster than row by row in theory.
> > > >
> > > >
> > > > Please let me know any feedbacks? If you ever dealed with PK data
> > support,
> > > > please share some thougths and experience.
> > > >
> > > > Thanks for your kind help.
> > > >
> > > > _________________________________________________________________
> > > > Hotmail: Free, trusted and rich email service.
> > > > http://clk.atdmt.com/GBL/go/171222984/direct/01/
> > > >
> >
> > _________________________________________________________________
> > Hotmail: Powerful Free email with security by Microsoft.
> > http://clk.atdmt.com/GBL/go/171222986/direct/01/
> >
>

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.