Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

checking existing docs before indexing

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


heba.farouk at yahoo

Jul 12, 2007, 6:27 AM

Post #1 of 5 (1224 views)
Permalink
checking existing docs before indexing

Hello
i'm a newbie to lucene world and i hope that u help me.
i was asking is there any options in IndexWriter to check if a document already exsits before adding it to the index or i should maintain it manually ??

thanks in advance


Yours

Heba


---------------------------------
Choose the right car based on your needs. Check out Yahoo! Autos new Car Finder tool.


erickerickson at gmail

Jul 12, 2007, 7:17 AM

Post #2 of 5 (1187 views)
Permalink
Re: checking existing docs before indexing [In reply to]

You have to check yourself. Lucene has no concept of relations
*between* documents. What you're really asking for is something
like a database unique key. No such luck, you have to create
one yourself.

What I've done is post-process the entire index, removing duplicates.
This can be done quite efficiently with TermDocs/TermEnum, and you
can then institute policies like, say, LIFO or FIFO.

You could also certainly check before adding a document, also
using TermEnum/TermDocs.

Best
Erick

On 7/12/07, Heba Farouk <heba.farouk [at] yahoo> wrote:
>
> Hello
> i'm a newbie to lucene world and i hope that u help me.
> i was asking is there any options in IndexWriter to check if a document
> already exsits before adding it to the index or i should maintain it
> manually ??
>
> thanks in advance
>
>
> Yours
>
> Heba
>
>
> ---------------------------------
> Choose the right car based on your needs. Check out Yahoo! Autos new Car
> Finder tool.


neeraj.gupta.2 at hewitt

Jul 12, 2007, 7:20 AM

Post #3 of 5 (1155 views)
Permalink
Re: checking existing docs before indexing [In reply to]

Hi,

You an use updateDocument() method of IndexWriter to update any existing
document.. It searches for a document matching the Term, if document
existes then delete that document. After that it adds the provided
document to the indexes in both the cases whether document exists or not.

Cheers,
Neeraj




"Heba Farouk" <heba.farouk [at] yahoo>

07/12/2007 06:57 PM
Please respond to
java-user [at] lucene, heba.farouk [at] yahoo



To
java-user [at] lucene
cc

Subject
checking existing docs before indexing






Hello
i'm a newbie to lucene world and i hope that u help me.
i was asking is there any options in IndexWriter to check if a document
already exsits before adding it to the index or i should maintain it
manually ??

thanks in advance


Yours

Heba


---------------------------------
Choose the right car based on your needs. Check out Yahoo! Autos new Car
Finder tool.


The information contained in this e-mail and any accompanying documents may contain information that is confidential or otherwise protected from disclosure. If you are not the intended recipient of this message, or if this message has been addressed to you in error, please immediately alert the sender by reply e-mail and then delete this message, including any attachments. Any dissemination, distribution or other use of the contents of this message by anyone other than the intended recipient
is strictly prohibited.


samuel.lemoine at lingway

Jul 12, 2007, 9:08 AM

Post #4 of 5 (1140 views)
Permalink
Re: checking existing docs before indexing [In reply to]

Neeraj Gupta a écrit :
> Hi,
>
> You an use updateDocument() method of IndexWriter to update any existing
> document.. It searches for a document matching the Term, if document
> existes then delete that document. After that it adds the provided
> document to the indexes in both the cases whether document exists or not.
>
> Cheers,
> Neeraj
>
>
>
>
> "Heba Farouk" <heba.farouk [at] yahoo>
>
> 07/12/2007 06:57 PM
> Please respond to
> java-user [at] lucene, heba.farouk [at] yahoo
>
>
>
> To
> java-user [at] lucene
> cc
>
> Subject
> checking existing docs before indexing
>
>
>
>
>
>
> Hello
> i'm a newbie to lucene world and i hope that u help me.
> i was asking is there any options in IndexWriter to check if a document
> already exsits before adding it to the index or i should maintain it
> manually ??
>
> thanks in advance
>
>
> Yours
>
> Heba
>
>
> ---------------------------------
> Choose the right car based on your needs. Check out Yahoo! Autos new Car
> Finder tool.
>
>
> The information contained in this e-mail and any accompanying documents may contain information that is confidential or otherwise protected from disclosure. If you are not the intended recipient of this message, or if this message has been addressed to you in error, please immediately alert the sender by reply e-mail and then delete this message, including any attachments. Any dissemination, distribution or other use of the contents of this message by anyone other than the intended recipient
> is strictly prohibited.
>
>
>
>
I also used the updateDocument() to do so, but I encountered the issue
that it takes a term as argument, so that other documents may be deleted
by this method. To avoid this, my conclusion was that a solution is to
store some stored untokenized fields, used as keys to identify solely a
document, each document being identified by a string that distinguish it
from others (such as url or file path).

Sam


PS: Here is the sample code I've wrote during my internship, quite
simple to grasp:
(there are no commentaries, I removed them as they were in french)
The method that could interest you is the addDocument(String) one.
Hope it helped.

public class Indexer {

private static final Logger theLogger = Logger.getLogger(Indexer.class);

private Analyzer theAnalyzer;
private IndexWriter theIndexWriter;
private Reader theReaderContent;
private String theIndexPath;

public Indexer(String anIndexPath) {
theAnalyzer = new StandardAnalyzer();
theIndexPath = anIndexPath;
}

public void addDocument(String aFileName){

try {
theIndexWriter = new IndexWriter(theIndexPath, theAnalyzer);
} catch (IOException e) {
theLogger.error(e);
}

Document doc = new Document();

try {
theReaderContent = new FileReader(aFileName);
} catch (FileNotFoundException e) {
theLogger.error(e);
}

TokenStream tokenStreamContent = new
StandardTokenizer(theReaderContent);
Field docPath = new Field("path", aFileName, Field.Store.YES,
Field.Index.UN_TOKENIZED);
Field docContent = new Field("content", tokenStreamContent);
doc.add(docPath);
doc.add(docContent);

try {
// theIndexWriter.addDocument(doc);
theIndexWriter.updateDocument(new Term("path",aFileName),doc);
theIndexWriter.close();
} catch (IOException e) {
theLogger.error(e);
}
}

public void sort(){
try {
theIndexWriter = new IndexWriter(theIndexPath, theAnalyzer);
theIndexWriter.optimize();
theIndexWriter.close();
} catch (IOException e) {
theLogger.error(e);
}
}


public void addAllDocuments(String aDirectoryPath){
File directory = new File(aDirectoryPath);
File[] subDirectory = directory.listFiles();
System.out.println(subDirectory.length+" fichiers ont été
indexés.");
for (File file : subDirectory) {
addDocument(file.getPath());
}
this.sort();
}
}


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


neeraj.gupta.2 at hewitt

Jul 12, 2007, 11:45 PM

Post #5 of 5 (1150 views)
Permalink
Re: checking existing docs before indexing [In reply to]

Yes, you need to store one untokenized field which will identifiy the
exact document you want to update.

You can also check whether any document like that exists in your indexes,
by using deleteDocuments() method of Indexreader. This returns the number
of documents deleted as per the Term provided.

Cheers,
Neeraj




"Samuel LEMOINE" <samuel.lemoine [at] lingway>

07/12/2007 09:38 PM
Please respond to
java-user [at] lucene



To
java-user [at] lucene
cc
heba.farouk [at] yahoo
Subject
Re: checking existing docs before indexing






Neeraj Gupta a écrit :
> Hi,
>
> You an use updateDocument() method of IndexWriter to update any existing

> document.. It searches for a document matching the Term, if document
> existes then delete that document. After that it adds the provided
> document to the indexes in both the cases whether document exists or
not.
>
> Cheers,
> Neeraj
>
>
>
>
> "Heba Farouk" <heba.farouk [at] yahoo>
>
> 07/12/2007 06:57 PM
> Please respond to
> java-user [at] lucene, heba.farouk [at] yahoo
>
>
>
> To
> java-user [at] lucene
> cc
>
> Subject
> checking existing docs before indexing
>
>
>
>
>
>
> Hello
> i'm a newbie to lucene world and i hope that u help me.
> i was asking is there any options in IndexWriter to check if a document
> already exsits before adding it to the index or i should maintain it
> manually ??
>
> thanks in advance
>
>
> Yours
>
> Heba
>
>
> ---------------------------------
> Choose the right car based on your needs. Check out Yahoo! Autos new
Car
> Finder tool.
>
>
> The information contained in this e-mail and any accompanying documents
may contain information that is confidential or otherwise protected from
disclosure. If you are not the intended recipient of this message, or if
this message has been addressed to you in error, please immediately alert
the sender by reply e-mail and then delete this message, including any
attachments. Any dissemination, distribution or other use of the contents
of this message by anyone other than the intended recipient
> is strictly prohibited.
>
>
>
>
I also used the updateDocument() to do so, but I encountered the issue
that it takes a term as argument, so that other documents may be deleted
by this method. To avoid this, my conclusion was that a solution is to
store some stored untokenized fields, used as keys to identify solely a
document, each document being identified by a string that distinguish it
from others (such as url or file path).

Sam


PS: Here is the sample code I've wrote during my internship, quite
simple to grasp:
(there are no commentaries, I removed them as they were in french)
The method that could interest you is the addDocument(String) one.
Hope it helped.

public class Indexer {

private static final Logger theLogger =
Logger.getLogger(Indexer.class);

private Analyzer theAnalyzer;
private IndexWriter theIndexWriter;
private Reader theReaderContent;
private String theIndexPath;

public Indexer(String anIndexPath) {
theAnalyzer = new StandardAnalyzer();
theIndexPath = anIndexPath;
}

public void addDocument(String aFileName){

try {
theIndexWriter = new IndexWriter(theIndexPath, theAnalyzer);
} catch (IOException e) {
theLogger.error(e);
}

Document doc = new Document();

try {
theReaderContent = new FileReader(aFileName);
} catch (FileNotFoundException e) {
theLogger.error(e);
}

TokenStream tokenStreamContent = new
StandardTokenizer(theReaderContent);
Field docPath = new Field("path", aFileName, Field.Store.YES,
Field.Index.UN_TOKENIZED);
Field docContent = new Field("content", tokenStreamContent);
doc.add(docPath);
doc.add(docContent);

try {
// theIndexWriter.addDocument(doc);
theIndexWriter.updateDocument(new Term("path",aFileName),doc);
theIndexWriter.close();
} catch (IOException e) {
theLogger.error(e);
}
}

public void sort(){
try {
theIndexWriter = new IndexWriter(theIndexPath, theAnalyzer);
theIndexWriter.optimize();
theIndexWriter.close();
} catch (IOException e) {
theLogger.error(e);
}
}


public void addAllDocuments(String aDirectoryPath){
File directory = new File(aDirectoryPath);
File[] subDirectory = directory.listFiles();
System.out.println(subDirectory.length+" fichiers ont été
indexés.");
for (File file : subDirectory) {
addDocument(file.getPath());
}
this.sort();
}
}


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene





The information contained in this e-mail and any accompanying documents may contain information that is confidential or otherwise protected from disclosure. If you are not the intended recipient of this message, or if this message has been addressed to you in error, please immediately alert the sender by reply e-mail and then delete this message, including any attachments. Any dissemination, distribution or other use of the contents of this message by anyone other than the intended recipient
is strictly prohibited.

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.