Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: General

how to Index only newly added documents?

 

 

Lucene general RSS feed   Index | Next | Previous | View Threaded


t.sapra97 at gmail

Nov 3, 2009, 4:06 AM

Post #1 of 6 (127 views)
Permalink
how to Index only newly added documents?

Hi People,

I am stuck with a problem ,i have a resources directory in which i have lot
of documents , my java programs picks up documents from this directory, is
there a way using lucene APIs to recognize documents that have already been
indexed and thus filter then out and use only newly added documents.

Thanks
Tarun
--
View this message in context: http://old.nabble.com/how-to-Index-only-newly-added-documents--tp26160082p26160082.html
Sent from the Lucene - General mailing list archive at Nabble.com.


rodrigofurtado at saneago

Nov 3, 2009, 4:26 AM

Post #2 of 6 (118 views)
Permalink
Re: how to Index only newly added documents? [In reply to]

Look the class:

org.pdfbox.searchengine.lucene.IndexFiles

This a example classe for create and indexing documents when you add or
delete the documents into a directory.

Basicaly you indicate this when run this class:

For create de index directory try this:

java -Xms256m -Xmx512m org.pdfbox.searchengine.lucene.IndexFiles -create
-index <your_index_directory> <your_documents_directory>


For only index directory (new or deleted files) try this (note the second
argument '-create' is not present):


java -Xms256m -Xmx512m org.pdfbox.searchengine.lucene.IndexFiles -index
<your_index_directory> <your_documents_directory>


Bye

>
> Hi People,
>
> I am stuck with a problem ,i have a resources directory in which i have
> lot
> of documents , my java programs picks up documents from this directory, is
> there a way using lucene APIs to recognize documents that have already
> been
> indexed and thus filter then out and use only newly added documents.
>
> Thanks
> Tarun
> --
> View this message in context:
> http://old.nabble.com/how-to-Index-only-newly-added-documents--tp26160082p26160082.html
> Sent from the Lucene - General mailing list archive at Nabble.com.
>
>


diego.cassinera at mercadolibre

Nov 3, 2009, 4:34 AM

Post #3 of 6 (119 views)
Permalink
Re: how to Index only newly added documents? [In reply to]

The api a"lows you to add documents to an index. However it does not have any functionality to detect which ones are new or changed. Regardless, this is some what a trivial thing to do. Just write a index app that reads the file names from standard input. On a linux shell use find or ls and pipe the result to your app.

Diego
------Original Message------
From: tarunsapra
To: general[at]lucene.apache.org
ReplyTo: general[at]lucene.apache.org
Subject: how to Index only newly added documents?
Sent: Nov 3, 2009 9:06 AM


Hi People,

I am stuck with a problem ,i have a resources directory in which i have lot
of documents , my java programs picks up documents from this directory, is
there a way using lucene APIs to recognize documents that have already been
indexed and thus filter then out and use only newly added documents.

Thanks
Tarun
--
View this message in context: http://old.nabble.com/how-to-Index-only-newly-added-documents--tp26160082p26160082.html
Sent from the Lucene - General mailing list archive at Nabble.com.


Enviado desde mi BlackBerry® de Claro Argentina


t.sapra97 at gmail

Nov 3, 2009, 9:41 PM

Post #4 of 6 (114 views)
Permalink
Re: how to Index only newly added documents? [In reply to]

thanks for the reply!..

BUt i need to filter out the already indexed documenst ...i.e if the
resouces directory contains 2 documents which are indexed , then when 2 more
documents are added then the indexed should only index the newly added
documents in the already existing index location.
Thanks

rodrigofurtado wrote:
>
> Look the class:
>
> org.pdfbox.searchengine.lucene.IndexFiles
>
> This a example classe for create and indexing documents when you add or
> delete the documents into a directory.
>
> Basicaly you indicate this when run this class:
>
> For create de index directory try this:
>
> java -Xms256m -Xmx512m org.pdfbox.searchengine.lucene.IndexFiles -create
> -index <your_index_directory> <your_documents_directory>
>
>
> For only index directory (new or deleted files) try this (note the second
> argument '-create' is not present):
>
>
> java -Xms256m -Xmx512m org.pdfbox.searchengine.lucene.IndexFiles -index
> <your_index_directory> <your_documents_directory>
>
>
> Bye
>
>>
>> Hi People,
>>
>> I am stuck with a problem ,i have a resources directory in which i have
>> lot
>> of documents , my java programs picks up documents from this directory,
>> is
>> there a way using lucene APIs to recognize documents that have already
>> been
>> indexed and thus filter then out and use only newly added documents.
>>
>> Thanks
>> Tarun
>> --
>> View this message in context:
>> http://old.nabble.com/how-to-Index-only-newly-added-documents--tp26160082p26160082.html
>> Sent from the Lucene - General mailing list archive at Nabble.com.
>>
>>
>
>
>
>

--
View this message in context: http://old.nabble.com/how-to-Index-only-newly-added-documents--tp26160082p26191281.html
Sent from the Lucene - General mailing list archive at Nabble.com.


simon.willnauer at googlemail

Nov 4, 2009, 8:19 AM

Post #5 of 6 (107 views)
Permalink
Re: how to Index only newly added documents? [In reply to]

The common approach is to use a UUID field in the index and run an
updateDocument with a delete term holding the UUID for a document.
That way only the latest added document for a UUID is gonna end up in
the index.

simon

On Wed, Nov 4, 2009 at 6:41 AM, tarunsapra <t.sapra97[at]gmail.com> wrote:
>
> thanks for the reply!..
>
> BUt  i need to filter out the already indexed documenst ...i.e if the
> resouces directory contains 2 documents which are indexed , then when 2 more
> documents are added then the indexed should only index the newly added
> documents in the already existing index location.
> Thanks
>
> rodrigofurtado wrote:
>>
>> Look the class:
>>
>> org.pdfbox.searchengine.lucene.IndexFiles
>>
>> This a example classe for create and indexing documents when you add or
>> delete the documents into a directory.
>>
>> Basicaly you indicate this when run this class:
>>
>> For create de index directory try this:
>>
>> java -Xms256m -Xmx512m org.pdfbox.searchengine.lucene.IndexFiles -create
>> -index  <your_index_directory> <your_documents_directory>
>>
>>
>> For only index directory (new or deleted files) try this (note the second
>> argument '-create' is not present):
>>
>>
>> java -Xms256m -Xmx512m org.pdfbox.searchengine.lucene.IndexFiles -index
>> <your_index_directory> <your_documents_directory>
>>
>>
>> Bye
>>
>>>
>>> Hi People,
>>>
>>> I am stuck with a problem ,i have a resources directory in which i have
>>> lot
>>> of documents , my java programs picks up documents from this directory,
>>> is
>>> there a way using lucene APIs to recognize documents that have already
>>> been
>>> indexed and thus filter then out and use only newly added documents.
>>>
>>> Thanks
>>> Tarun
>>> --
>>> View this message in context:
>>> http://old.nabble.com/how-to-Index-only-newly-added-documents--tp26160082p26160082.html
>>> Sent from the Lucene - General mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>>
>>
>
> --
> View this message in context: http://old.nabble.com/how-to-Index-only-newly-added-documents--tp26160082p26191281.html
> Sent from the Lucene - General mailing list archive at Nabble.com.
>
>


skant at sloan

Nov 4, 2009, 10:22 AM

Post #6 of 6 (107 views)
Permalink
Re: how to Index only newly added documents? [In reply to]

Like Simon mentioned you might want to create a document identifier or
UUID - if you don't have one already
and use this code snippet to check if doc exists:

string doc_id = "1234567";
Term idTerm = new Term(Fields.DOCID_FIELD,doc_id);
if (mSearcher.docFreq(idTerm) > 0) {
//mIndexWriter.updateDocument(idTerm,doc);
//This document exists hence skip
} else {
mIndexWriter.addDocument(doc);
}

Lucene general RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.