Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

How to avoid duplicate records in lucene

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


sebasmtech at gmail

Jul 19, 2008, 4:30 AM

Post #1 of 13 (689 views)
Permalink
How to avoid duplicate records in lucene

Hi All,

Is there any possibility to avoid duplicate records in lucene 2.3.1?
--
View this message in context: http://www.nabble.com/How-to-avoid-duplicate-records-in-lucene-tp18543588p18543588.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


markharw00d at yahoo

Jul 19, 2008, 8:44 AM

Post #2 of 13 (653 views)
Permalink
Re: How to avoid duplicate records in lucene [In reply to]

Sebastin wrote:
> Hi All,
>
> Is there any possibility to avoid duplicate records in lucene 2.3.1?
>

At index-time or query time?
See DuplicateFilter in contrib/queries for a query-time filter


Cheers
Mark


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


markrmiller at gmail

Jul 20, 2008, 7:11 PM

Post #3 of 13 (633 views)
Permalink
Re: How to avoid duplicate records in lucene [In reply to]

Sebastin wrote:
> Hi All,
>
> Is there any possibility to avoid duplicate records in lucene 2.3.1?
>
I don't believe that there is a very high performance way to do this.
You are basically going to have to query the index for an id before
adding a new doc. The best way I can think of off the top of my head is
to batch - first check that ids in the batch are unique, then check all
ids in the batch against the IndexReader, then add the ones that are not
dupes. Of course all of your docs would have to be added through this
single choke point so that you knew other threads had not added that id
after the first thread had looked but before it added the doc.

I think Mark H has you covered if getting the dupes out after are okay.

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


sebasmtech at gmail

Jul 21, 2008, 6:40 AM

Post #4 of 13 (621 views)
Permalink
Re: How to avoid duplicate records in lucene [In reply to]

at the time search , while querying the data
markrmiller wrote:
>
> Sebastin wrote:
>> Hi All,
>>
>> Is there any possibility to avoid duplicate records in lucene 2.3.1?
>>
> I don't believe that there is a very high performance way to do this.
> You are basically going to have to query the index for an id before
> adding a new doc. The best way I can think of off the top of my head is
> to batch - first check that ids in the batch are unique, then check all
> ids in the batch against the IndexReader, then add the ones that are not
> dupes. Of course all of your docs would have to be added through this
> single choke point so that you knew other threads had not added that id
> after the first thread had looked but before it added the doc.
>
> I think Mark H has you covered if getting the dupes out after are okay.
>
> - Mark
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>
>
>

--
View this message in context: http://www.nabble.com/How-to-avoid-duplicate-records-in-lucene-tp18543588p18568862.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


erickerickson at gmail

Jul 21, 2008, 7:18 AM

Post #5 of 13 (621 views)
Permalink
Re: How to avoid duplicate records in lucene [In reply to]

could you define duplicate? As far as I know, you don't
get the same (internal) doc id back more than once, so what
is a duplicate?

Best
Erick

On Mon, Jul 21, 2008 at 9:40 AM, Sebastin <sebasmtech[at]gmail.com> wrote:

>
> at the time search , while querying the data
> markrmiller wrote:
> >
> > Sebastin wrote:
> >> Hi All,
> >>
> >> Is there any possibility to avoid duplicate records in lucene 2.3.1?
> >>
> > I don't believe that there is a very high performance way to do this.
> > You are basically going to have to query the index for an id before
> > adding a new doc. The best way I can think of off the top of my head is
> > to batch - first check that ids in the batch are unique, then check all
> > ids in the batch against the IndexReader, then add the ones that are not
> > dupes. Of course all of your docs would have to be added through this
> > single choke point so that you knew other threads had not added that id
> > after the first thread had looked but before it added the doc.
> >
> > I think Mark H has you covered if getting the dupes out after are okay.
> >
> > - Mark
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> > For additional commands, e-mail: java-user-help[at]lucene.apache.org
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/How-to-avoid-duplicate-records-in-lucene-tp18543588p18568862.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>
>


markharw00d at yahoo

Jul 21, 2008, 11:44 AM

Post #6 of 13 (616 views)
Permalink
Re: How to avoid duplicate records in lucene [In reply to]

>>could you define duplicate?

That's your choice of field that you want to de-dup on.
That could be a field such as "DatabasePrimaryKey" or perhaps a field
containing an MD5 hash of document content.
The DuplicateFilter ensures only one document can exist in results for
each unique value for the choice of field.

Cheers
Mark

Erick Erickson wrote:
> could you define duplicate? As far as I know, you don't
> get the same (internal) doc id back more than once, so what
> is a duplicate?
>
> Best
> Erick
>
> On Mon, Jul 21, 2008 at 9:40 AM, Sebastin <sebasmtech[at]gmail.com> wrote:
>
>
>> at the time search , while querying the data
>> markrmiller wrote:
>>
>>> Sebastin wrote:
>>>
>>>> Hi All,
>>>>
>>>> Is there any possibility to avoid duplicate records in lucene 2.3.1?
>>>>
>>>>
>>> I don't believe that there is a very high performance way to do this.
>>> You are basically going to have to query the index for an id before
>>> adding a new doc. The best way I can think of off the top of my head is
>>> to batch - first check that ids in the batch are unique, then check all
>>> ids in the batch against the IndexReader, then add the ones that are not
>>> dupes. Of course all of your docs would have to be added through this
>>> single choke point so that you knew other threads had not added that id
>>> after the first thread had looked but before it added the doc.
>>>
>>> I think Mark H has you covered if getting the dupes out after are okay.
>>>
>>> - Mark
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
>>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>>>
>>>
>>>
>>>
>> --
>> View this message in context:
>> http://www.nabble.com/How-to-avoid-duplicate-records-in-lucene-tp18543588p18568862.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>>
>>
>>
>
>
> ------------------------------------------------------------------------
>
> No virus found in this incoming message.
> Checked by AVG.
> Version: 7.5.526 / Virus Database: 270.5.3/1563 - Release Date: 20/07/2008 12:59
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


eksdev at yahoo

Jul 21, 2008, 11:52 AM

Post #7 of 13 (615 views)
Permalink
Re: How to avoid duplicate records in lucene [In reply to]

you could maintain your bloom filter and check only "positives" if they are not false positives with exact search, if you have small percentage of duplicates (unique documents dominate updates) this will help you a lot on performance side



----- Original Message ----
> From: markharw00d <markharw00d[at]yahoo.co.uk>
> To: java-user[at]lucene.apache.org
> Sent: Monday, 21 July, 2008 8:44:26 PM
> Subject: Re: How to avoid duplicate records in lucene
>
> >>could you define duplicate?
>
> That's your choice of field that you want to de-dup on.
> That could be a field such as "DatabasePrimaryKey" or perhaps a field
> containing an MD5 hash of document content.
> The DuplicateFilter ensures only one document can exist in results for
> each unique value for the choice of field.
>
> Cheers
> Mark
>
> Erick Erickson wrote:
> > could you define duplicate? As far as I know, you don't
> > get the same (internal) doc id back more than once, so what
> > is a duplicate?
> >
> > Best
> > Erick
> >
> > On Mon, Jul 21, 2008 at 9:40 AM, Sebastin wrote:
> >
> >
> >> at the time search , while querying the data
> >> markrmiller wrote:
> >>
> >>> Sebastin wrote:
> >>>
> >>>> Hi All,
> >>>>
> >>>> Is there any possibility to avoid duplicate records in lucene 2.3.1?
> >>>>
> >>>>
> >>> I don't believe that there is a very high performance way to do this.
> >>> You are basically going to have to query the index for an id before
> >>> adding a new doc. The best way I can think of off the top of my head is
> >>> to batch - first check that ids in the batch are unique, then check all
> >>> ids in the batch against the IndexReader, then add the ones that are not
> >>> dupes. Of course all of your docs would have to be added through this
> >>> single choke point so that you knew other threads had not added that id
> >>> after the first thread had looked but before it added the doc.
> >>>
> >>> I think Mark H has you covered if getting the dupes out after are okay.
> >>>
> >>> - Mark
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> >>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
> >>>
> >>>
> >>>
> >>>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/How-to-avoid-duplicate-records-in-lucene-tp18543588p18568862.html
> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> >> For additional commands, e-mail: java-user-help[at]lucene.apache.org
> >>
> >>
> >>
> >
> >
> > ------------------------------------------------------------------------
> >
> > No virus found in this incoming message.
> > Checked by AVG.
> > Version: 7.5.526 / Virus Database: 270.5.3/1563 - Release Date: 20/07/2008
> 12:59
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org



__________________________________________________________
Not happy with your email address?.
Get the one you really want - millions of new email addresses available now at Yahoo! http://uk.docs.yahoo.com/ymail/new.html

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


erickerickson at gmail

Jul 22, 2008, 6:37 AM

Post #8 of 13 (593 views)
Permalink
Re: How to avoid duplicate records in lucene [In reply to]

Well, the point of my question was to insure that we were all using common
terms. For all we know, the original questioner considered "duplicate"
records ones that had identical, or even similar text. Nothing in the
original question indicated any de-dup happening.

I've often found that assumptions that we are all talking about the same
thing are...er...incorrect. And I don't want to waste my time answering
questions that weren't what was asked......

Best
Erick

On Mon, Jul 21, 2008 at 2:44 PM, markharw00d <markharw00d[at]yahoo.co.uk>
wrote:

>
> That's your choice of field that you want to de-dup on.
> That could be a field such as "DatabasePrimaryKey" or perhaps a field
> containing an MD5 hash of document content.
> The DuplicateFilter ensures only one document can exist in results for each
> unique value for the choice of field.
>
> Cheers
> Mark
>
> Erick Erickson wrote:
>
>> could you define duplicate? As far as I know, you don't
>> get the same (internal) doc id back more than once, so what
>> is a duplicate?
>>
>> Best
>> Erick
>>
>> On Mon, Jul 21, 2008 at 9:40 AM, Sebastin <sebasmtech[at]gmail.com> wrote:
>>
>>
>>
>>> at the time search , while querying the data
>>> markrmiller wrote:
>>>
>>>
>>>> Sebastin wrote:
>>>>
>>>>
>>>>> Hi All,
>>>>>
>>>>> Is there any possibility to avoid duplicate records in lucene 2.3.1?
>>>>>
>>>>>
>>>>>
>>>> I don't believe that there is a very high performance way to do this.
>>>> You are basically going to have to query the index for an id before
>>>> adding a new doc. The best way I can think of off the top of my head is
>>>> to batch - first check that ids in the batch are unique, then check all
>>>> ids in the batch against the IndexReader, then add the ones that are not
>>>> dupes. Of course all of your docs would have to be added through this
>>>> single choke point so that you knew other threads had not added that id
>>>> after the first thread had looked but before it added the doc.
>>>>
>>>> I think Mark H has you covered if getting the dupes out after are okay.
>>>>
>>>> - Mark
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
>>>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>>>>
>>>>
>>>>
>>>>
>>>>
>>> --
>>> View this message in context:
>>>
>>> http://www.nabble.com/How-to-avoid-duplicate-records-in-lucene-tp18543588p18568862.html
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
>>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>>>
>>>
>>>
>>>
>>
>> ------------------------------------------------------------------------
>>
>> No virus found in this incoming message.
>> Checked by AVG. Version: 7.5.526 / Virus Database: 270.5.3/1563 - Release
>> Date: 20/07/2008 12:59
>>
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>
>


markharw00d at yahoo

Jul 22, 2008, 11:09 AM

Post #9 of 13 (581 views)
Permalink
Re: How to avoid duplicate records in lucene [In reply to]

>>Well, the point of my question was to insure that we were all using common terms.

Sorry, Erick. I thought your "define duplicate" question was asking me about DuplicateFilter's concept of duplicates rather than asking the original poster about his notion of what a duplicate document meant to him. You're right it would be useful to understand more about the intention of the original message.

Cheers
Mark





----- Original Message ----
From: Erick Erickson <erickerickson[at]gmail.com>
To: java-user[at]lucene.apache.org
Sent: Tuesday, 22 July, 2008 2:37:50 PM
Subject: Re: How to avoid duplicate records in lucene

Well, the point of my question was to insure that we were all using common
terms. For all we know, the original questioner considered "duplicate"
records ones that had identical, or even similar text. Nothing in the
original question indicated any de-dup happening.

I've often found that assumptions that we are all talking about the same
thing are...er...incorrect. And I don't want to waste my time answering
questions that weren't what was asked......

Best
Erick

On Mon, Jul 21, 2008 at 2:44 PM, markharw00d <markharw00d[at]yahoo.co.uk>
wrote:

>
> That's your choice of field that you want to de-dup on.
> That could be a field such as "DatabasePrimaryKey" or perhaps a field
> containing an MD5 hash of document content.
> The DuplicateFilter ensures only one document can exist in results for each
> unique value for the choice of field.
>
> Cheers
> Mark
>
> Erick Erickson wrote:
>
>> could you define duplicate? As far as I know, you don't
>> get the same (internal) doc id back more than once, so what
>> is a duplicate?
>>
>> Best
>> Erick
>>
>> On Mon, Jul 21, 2008 at 9:40 AM, Sebastin <sebasmtech[at]gmail.com> wrote:
>>
>>
>>
>>> at the time search , while querying the data
>>> markrmiller wrote:
>>>
>>>
>>>> Sebastin wrote:
>>>>
>>>>
>>>>> Hi All,
>>>>>
>>>>> Is there any possibility to avoid duplicate records in lucene 2.3.1?
>>>>>
>>>>>
>>>>>
>>>> I don't believe that there is a very high performance way to do this.
>>>> You are basically going to have to query the index for an id before
>>>> adding a new doc. The best way I can think of off the top of my head is
>>>> to batch - first check that ids in the batch are unique, then check all
>>>> ids in the batch against the IndexReader, then add the ones that are not
>>>> dupes. Of course all of your docs would have to be added through this
>>>> single choke point so that you knew other threads had not added that id
>>>> after the first thread had looked but before it added the doc.
>>>>
>>>> I think Mark H has you covered if getting the dupes out after are okay.
>>>>
>>>> - Mark
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
>>>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>>>>
>>>>
>>>>
>>>>
>>>>
>>> --
>>> View this message in context:
>>>
>>> http://www.nabble.com/How-to-avoid-duplicate-records-in-lucene-tp18543588p18568862.html
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
>>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>>>
>>>
>>>
>>>
>>
>> ------------------------------------------------------------------------
>>
>> No virus found in this incoming message.
>> Checked by AVG. Version: 7.5.526 / Virus Database: 270.5.3/1563 - Release
>> Date: 20/07/2008 12:59
>>
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>
>



__________________________________________________________
Not happy with your email address?.
Get the one you really want - millions of new email addresses available now at Yahoo! http://uk.docs.yahoo.com/ymail/new.html

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


erickerickson at gmail

Jul 22, 2008, 11:22 AM

Post #10 of 13 (582 views)
Permalink
Re: How to avoid duplicate records in lucene [In reply to]

NP, if my original reply had included my second one, then you'd have
known what I was talking about <G>...

I *love* it when I unknowingly demonstrate the issue I'm trying to clarify
<G>.

Best
Erick

On Tue, Jul 22, 2008 at 2:09 PM, mark harwood <markharw00d[at]yahoo.co.uk>
wrote:

> >>Well, the point of my question was to insure that we were all using
> common terms.
>
> Sorry, Erick. I thought your "define duplicate" question was asking me
> about DuplicateFilter's concept of duplicates rather than asking the
> original poster about his notion of what a duplicate document meant to him.
> You're right it would be useful to understand more about the intention of
> the original message.
>
> Cheers
> Mark
>
>
>
>
>
> ----- Original Message ----
> From: Erick Erickson <erickerickson[at]gmail.com>
> To: java-user[at]lucene.apache.org
> Sent: Tuesday, 22 July, 2008 2:37:50 PM
> Subject: Re: How to avoid duplicate records in lucene
>
> Well, the point of my question was to insure that we were all using common
> terms. For all we know, the original questioner considered "duplicate"
> records ones that had identical, or even similar text. Nothing in the
> original question indicated any de-dup happening.
>
> I've often found that assumptions that we are all talking about the same
> thing are...er...incorrect. And I don't want to waste my time answering
> questions that weren't what was asked......
>
> Best
> Erick
>
> On Mon, Jul 21, 2008 at 2:44 PM, markharw00d <markharw00d[at]yahoo.co.uk>
> wrote:
>
> > >>could you define duplicate?
> >
> > That's your choice of field that you want to de-dup on.
> > That could be a field such as "DatabasePrimaryKey" or perhaps a field
> > containing an MD5 hash of document content.
> > The DuplicateFilter ensures only one document can exist in results for
> each
> > unique value for the choice of field.
> >
> > Cheers
> > Mark
> >
> > Erick Erickson wrote:
> >
> >> could you define duplicate? As far as I know, you don't
> >> get the same (internal) doc id back more than once, so what
> >> is a duplicate?
> >>
> >> Best
> >> Erick
> >>
> >> On Mon, Jul 21, 2008 at 9:40 AM, Sebastin <sebasmtech[at]gmail.com> wrote:
> >>
> >>
> >>
> >>> at the time search , while querying the data
> >>> markrmiller wrote:
> >>>
> >>>
> >>>> Sebastin wrote:
> >>>>
> >>>>
> >>>>> Hi All,
> >>>>>
> >>>>> Is there any possibility to avoid duplicate records in lucene 2.3.1?
> >>>>>
> >>>>>
> >>>>>
> >>>> I don't believe that there is a very high performance way to do this.
> >>>> You are basically going to have to query the index for an id before
> >>>> adding a new doc. The best way I can think of off the top of my head
> is
> >>>> to batch - first check that ids in the batch are unique, then check
> all
> >>>> ids in the batch against the IndexReader, then add the ones that are
> not
> >>>> dupes. Of course all of your docs would have to be added through this
> >>>> single choke point so that you knew other threads had not added that
> id
> >>>> after the first thread had looked but before it added the doc.
> >>>>
> >>>> I think Mark H has you covered if getting the dupes out after are
> okay.
> >>>>
> >>>> - Mark
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>> --
> >>> View this message in context:
> >>>
> >>>
> http://www.nabble.com/How-to-avoid-duplicate-records-in-lucene-tp18543588p18568862.html
> >>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> >>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> ------------------------------------------------------------------------
> >>
> >> No virus found in this incoming message.
> >> Checked by AVG. Version: 7.5.526 / Virus Database: 270.5.3/1563 -
> Release
> >> Date: 20/07/2008 12:59
> >>
> >>
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> > For additional commands, e-mail: java-user-help[at]lucene.apache.org
> >
> >
>
>
>
> __________________________________________________________
> Not happy with your email address?.
> Get the one you really want - millions of new email addresses available now
> at Yahoo! http://uk.docs.yahoo.com/ymail/new.html
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>
>


sebasmtech at gmail

Jul 22, 2008, 9:36 PM

Post #11 of 13 (577 views)
Permalink
Re: How to avoid duplicate records in lucene [In reply to]

Erick,

example,

IndexWriter writer = new IndexWriter("C:/index",new
StandardAnalyzer(),true);

String records = "Lucene" +" " +"action"+" "+"book" ;

Document doc = new Document();

doc.add(new
Field("contents",records,Field.Store.YES,Field.Index.TOKENIZED));


writer.addDocument(doc);
writer.optimize();
writer.close();


when the records is inserted twice,while querying for "Lucene" it will
display the same record twice.








mark harwood wrote:
>
>>>Well, the point of my question was to insure that we were all using
common terms.
>
> Sorry, Erick. I thought your "define duplicate" question was asking me
> about DuplicateFilter's concept of duplicates rather than asking the
> original poster about his notion of what a duplicate document meant to
> him. You're right it would be useful to understand more about the
> intention of the original message.
>
> Cheers
> Mark
>
>
>
>
>
> ----- Original Message ----
> From: Erick Erickson <erickerickson[at]gmail.com>
> To: java-user[at]lucene.apache.org
> Sent: Tuesday, 22 July, 2008 2:37:50 PM
> Subject: Re: How to avoid duplicate records in lucene
>
> Well, the point of my question was to insure that we were all using common
> terms. For all we know, the original questioner considered "duplicate"
> records ones that had identical, or even similar text. Nothing in the
> original question indicated any de-dup happening.
>
> I've often found that assumptions that we are all talking about the same
> thing are...er...incorrect. And I don't want to waste my time answering
> questions that weren't what was asked......
>
> Best
> Erick
>
> On Mon, Jul 21, 2008 at 2:44 PM, markharw00d <markharw00d[at]yahoo.co.uk>
> wrote:
>
>> >>could you define duplicate?
>>
>> That's your choice of field that you want to de-dup on.
>> That could be a field such as "DatabasePrimaryKey" or perhaps a field
>> containing an MD5 hash of document content.
>> The DuplicateFilter ensures only one document can exist in results for
>> each
>> unique value for the choice of field.
>>
>> Cheers
>> Mark
>>
>> Erick Erickson wrote:
>>
>>> could you define duplicate? As far as I know, you don't
>>> get the same (internal) doc id back more than once, so what
>>> is a duplicate?
>>>
>>> Best
>>> Erick
>>>
>>> On Mon, Jul 21, 2008 at 9:40 AM, Sebastin <sebasmtech[at]gmail.com> wrote:
>>>
>>>
>>>
>>>> at the time search , while querying the data
>>>> markrmiller wrote:
>>>>
>>>>
>>>>> Sebastin wrote:
>>>>>
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> Is there any possibility to avoid duplicate records in lucene 2.3.1?
>>>>>>
>>>>>>
>>>>>>
>>>>> I don't believe that there is a very high performance way to do this.
>>>>> You are basically going to have to query the index for an id before
>>>>> adding a new doc. The best way I can think of off the top of my head
>>>>> is
>>>>> to batch - first check that ids in the batch are unique, then check
>>>>> all
>>>>> ids in the batch against the IndexReader, then add the ones that are
>>>>> not
>>>>> dupes. Of course all of your docs would have to be added through this
>>>>> single choke point so that you knew other threads had not added that
>>>>> id
>>>>> after the first thread had looked but before it added the doc.
>>>>>
>>>>> I think Mark H has you covered if getting the dupes out after are
>>>>> okay.
>>>>>
>>>>> - Mark
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>> --
>>>> View this message in context:
>>>>
>>>> http://www.nabble.com/How-to-avoid-duplicate-records-in-lucene-tp18543588p18568862.html
>>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
>>>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> No virus found in this incoming message.
>>> Checked by AVG. Version: 7.5.526 / Virus Database: 270.5.3/1563 -
>>> Release
>>> Date: 20/07/2008 12:59
>>>
>>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>>
>>
>
>
>
> __________________________________________________________
> Not happy with your email address?.
> Get the one you really want - millions of new email addresses available
> now at Yahoo! http://uk.docs.yahoo.com/ymail/new.html
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>
>
>

--
View this message in context: http://www.nabble.com/How-to-avoid-duplicate-records-in-lucene-tp18543588p18603752.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


erickerickson at gmail

Jul 23, 2008, 6:40 AM

Post #12 of 13 (567 views)
Permalink
Re: How to avoid duplicate records in lucene [In reply to]

Well, yes, that's expected behavior. Lucene makes no attempt to filter
"substantially similar" documents. From Lucene's perspective, you added
it twice, you must have had a good reason.

And no, it doesn't really display the same document twice. There are two
documents in Lucene that happen to have identical content. They'd
have different internal document IDs.

So what problem are you really trying to solve? Or are you just trying to
understand Lucene?

If you require that identical (or similar) documents are indexed only once,
you have to insure that yourself. Lucene is an *engine*, not a full-blown
application. So the various kinds of business logic restrictions are up
to you to implement.

Best
Erick

On Wed, Jul 23, 2008 at 12:36 AM, Sebastin <sebasmtech[at]gmail.com> wrote:

>
> Erick,
>
> example,
>
> IndexWriter writer = new IndexWriter("C:/index",new
> StandardAnalyzer(),true);
>
> String records = "Lucene" +" " +"action"+" "+"book" ;
>
> Document doc = new Document();
>
> doc.add(new
> Field("contents",records,Field.Store.YES,Field.Index.TOKENIZED));
>
>
> writer.addDocument(doc);
> writer.optimize();
> writer.close();
>
>
> when the records is inserted twice,while querying for "Lucene" it will
> display the same record twice.
>
>
>
>
>
>
>
>
> mark harwood wrote:
> >
> >>>Well, the point of my question was to insure that we were all using
> common terms.
> >
> > Sorry, Erick. I thought your "define duplicate" question was asking me
> > about DuplicateFilter's concept of duplicates rather than asking the
> > original poster about his notion of what a duplicate document meant to
> > him. You're right it would be useful to understand more about the
> > intention of the original message.
> >
> > Cheers
> > Mark
> >
> >
> >
> >
> >
> > ----- Original Message ----
> > From: Erick Erickson <erickerickson[at]gmail.com>
> > To: java-user[at]lucene.apache.org
> > Sent: Tuesday, 22 July, 2008 2:37:50 PM
> > Subject: Re: How to avoid duplicate records in lucene
> >
> > Well, the point of my question was to insure that we were all using
> common
> > terms. For all we know, the original questioner considered "duplicate"
> > records ones that had identical, or even similar text. Nothing in the
> > original question indicated any de-dup happening.
> >
> > I've often found that assumptions that we are all talking about the same
> > thing are...er...incorrect. And I don't want to waste my time answering
> > questions that weren't what was asked......
> >
> > Best
> > Erick
> >
> > On Mon, Jul 21, 2008 at 2:44 PM, markharw00d <markharw00d[at]yahoo.co.uk>
> > wrote:
> >
> >> >>could you define duplicate?
> >>
> >> That's your choice of field that you want to de-dup on.
> >> That could be a field such as "DatabasePrimaryKey" or perhaps a field
> >> containing an MD5 hash of document content.
> >> The DuplicateFilter ensures only one document can exist in results for
> >> each
> >> unique value for the choice of field.
> >>
> >> Cheers
> >> Mark
> >>
> >> Erick Erickson wrote:
> >>
> >>> could you define duplicate? As far as I know, you don't
> >>> get the same (internal) doc id back more than once, so what
> >>> is a duplicate?
> >>>
> >>> Best
> >>> Erick
> >>>
> >>> On Mon, Jul 21, 2008 at 9:40 AM, Sebastin <sebasmtech[at]gmail.com>
> wrote:
> >>>
> >>>
> >>>
> >>>> at the time search , while querying the data
> >>>> markrmiller wrote:
> >>>>
> >>>>
> >>>>> Sebastin wrote:
> >>>>>
> >>>>>
> >>>>>> Hi All,
> >>>>>>
> >>>>>> Is there any possibility to avoid duplicate records in lucene
> 2.3.1?
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>> I don't believe that there is a very high performance way to do this.
> >>>>> You are basically going to have to query the index for an id before
> >>>>> adding a new doc. The best way I can think of off the top of my head
> >>>>> is
> >>>>> to batch - first check that ids in the batch are unique, then check
> >>>>> all
> >>>>> ids in the batch against the IndexReader, then add the ones that are
> >>>>> not
> >>>>> dupes. Of course all of your docs would have to be added through this
> >>>>> single choke point so that you knew other threads had not added that
> >>>>> id
> >>>>> after the first thread had looked but before it added the doc.
> >>>>>
> >>>>> I think Mark H has you covered if getting the dupes out after are
> >>>>> okay.
> >>>>>
> >>>>> - Mark
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> >>>>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>> --
> >>>> View this message in context:
> >>>>
> >>>>
> http://www.nabble.com/How-to-avoid-duplicate-records-in-lucene-tp18543588p18568862.html
> >>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>>
> ------------------------------------------------------------------------
> >>>
> >>> No virus found in this incoming message.
> >>> Checked by AVG. Version: 7.5.526 / Virus Database: 270.5.3/1563 -
> >>> Release
> >>> Date: 20/07/2008 12:59
> >>>
> >>>
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> >> For additional commands, e-mail: java-user-help[at]lucene.apache.org
> >>
> >>
> >
> >
> >
> > __________________________________________________________
> > Not happy with your email address?.
> > Get the one you really want - millions of new email addresses available
> > now at Yahoo! http://uk.docs.yahoo.com/ymail/new.html
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> > For additional commands, e-mail: java-user-help[at]lucene.apache.org
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/How-to-avoid-duplicate-records-in-lucene-tp18543588p18603752.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>
>


chris.lu at gmail

Jul 23, 2008, 4:19 PM

Post #13 of 13 (560 views)
Permalink
Re: How to avoid duplicate records in lucene [In reply to]

Sebastin,

Lucene is just like a plain database table. It doesn't have uniqueness
constraint.
So you can have two documents of the exact same content.

What you should do is to check for duplication before adding. And if
duplication is found, delete the old Document and add a new Document.
This way, you can control what your "unique key" is.

--
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got
2.6 Million Euro funding!

On Tue, Jul 22, 2008 at 9:36 PM, Sebastin <sebasmtech[at]gmail.com> wrote:

>
> Erick,
>
> example,
>
> IndexWriter writer = new IndexWriter("C:/index",new
> StandardAnalyzer(),true);
>
> String records = "Lucene" +" " +"action"+" "+"book" ;
>
> Document doc = new Document();
>
> doc.add(new
> Field("contents",records,Field.Store.YES,Field.Index.TOKENIZED));
>
>
> writer.addDocument(doc);
> writer.optimize();
> writer.close();
>
>
> when the records is inserted twice,while querying for "Lucene" it will
> display the same record twice.
>
>
>
>
>
>
>
>
> mark harwood wrote:
> >
> >>>Well, the point of my question was to insure that we were all using
> common terms.
> >
> > Sorry, Erick. I thought your "define duplicate" question was asking me
> > about DuplicateFilter's concept of duplicates rather than asking the
> > original poster about his notion of what a duplicate document meant to
> > him. You're right it would be useful to understand more about the
> > intention of the original message.
> >
> > Cheers
> > Mark
> >
> >
> >
> >
> >
> > ----- Original Message ----
> > From: Erick Erickson <erickerickson[at]gmail.com>
> > To: java-user[at]lucene.apache.org
> > Sent: Tuesday, 22 July, 2008 2:37:50 PM
> > Subject: Re: How to avoid duplicate records in lucene
> >
> > Well, the point of my question was to insure that we were all using
> common
> > terms. For all we know, the original questioner considered "duplicate"
> > records ones that had identical, or even similar text. Nothing in the
> > original question indicated any de-dup happening.
> >
> > I've often found that assumptions that we are all talking about the same
> > thing are...er...incorrect. And I don't want to waste my time answering
> > questions that weren't what was asked......
> >
> > Best
> > Erick
> >
> > On Mon, Jul 21, 2008 at 2:44 PM, markharw00d <markharw00d[at]yahoo.co.uk>
> > wrote:
> >
> >> >>could you define duplicate?
> >>
> >> That's your choice of field that you want to de-dup on.
> >> That could be a field such as "DatabasePrimaryKey" or perhaps a field
> >> containing an MD5 hash of document content.
> >> The DuplicateFilter ensures only one document can exist in results for
> >> each
> >> unique value for the choice of field.
> >>
> >> Cheers
> >> Mark
> >>
> >> Erick Erickson wrote:
> >>
> >>> could you define duplicate? As far as I know, you don't
> >>> get the same (internal) doc id back more than once, so what
> >>> is a duplicate?
> >>>
> >>> Best
> >>> Erick
> >>>
> >>> On Mon, Jul 21, 2008 at 9:40 AM, Sebastin <sebasmtech[at]gmail.com>
> wrote:
> >>>
> >>>
> >>>
> >>>> at the time search , while querying the data
> >>>> markrmiller wrote:
> >>>>
> >>>>
> >>>>> Sebastin wrote:
> >>>>>
> >>>>>
> >>>>>> Hi All,
> >>>>>>
> >>>>>> Is there any possibility to avoid duplicate records in lucene
> 2.3.1?
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>> I don't believe that there is a very high performance way to do this.
> >>>>> You are basically going to have to query the index for an id before
> >>>>> adding a new doc. The best way I can think of off the top of my head
> >>>>> is
> >>>>> to batch - first check that ids in the batch are unique, then check
> >>>>> all
> >>>>> ids in the batch against the IndexReader, then add the ones that are
> >>>>> not
> >>>>> dupes. Of course all of your docs would have to be added through this
> >>>>> single choke point so that you knew other threads had not added that
> >>>>> id
> >>>>> after the first thread had looked but before it added the doc.
> >>>>>
> >>>>> I think Mark H has you covered if getting the dupes out after are
> >>>>> okay.
> >>>>>
> >>>>> - Mark
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> >>>>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>> --
> >>>> View this message in context:
> >>>>
> >>>>
> http://www.nabble.com/How-to-avoid-duplicate-records-in-lucene-tp18543588p18568862.html
> >>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>>
> ------------------------------------------------------------------------
> >>>
> >>> No virus found in this incoming message.
> >>> Checked by AVG. Version: 7.5.526 / Virus Database: 270.5.3/1563 -
> >>> Release
> >>> Date: 20/07/2008 12:59
> >>>
> >>>
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> >> For additional commands, e-mail: java-user-help[at]lucene.apache.org
> >>
> >>
> >
> >
> >
> > __________________________________________________________
> > Not happy with your email address?.
> > Get the one you really want - millions of new email addresses available
> > now at Yahoo! http://uk.docs.yahoo.com/ymail/new.html
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> > For additional commands, e-mail: java-user-help[at]lucene.apache.org
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/How-to-avoid-duplicate-records-in-lucene-tp18543588p18603752.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>
>


--
Chris Lu
-------------------------
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got
2.6 Million Euro funding!

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.