julien.nioche at lingway
Feb 13, 2002, 10:08 AM
Post #1 of 4
By the way, I was wondering if there is any Analyzer that uses the following
Re : How does Lucene handle phrases containing words that are not indexed?
public Token(String text, int start, int end, String typ) ?
Maybe it could be interesting to build an analyzer that recognizes
punctuation marks and
keeps it in the index as Tokens with a given Type (say for example
The advantage is that information could be used by a
to avoid PhraseQuery containing a punctuation mark. Since PhraseQueries are
used for compound words
(e.g. "personal computer") with a given slop value (say 3), it could be
great not to match things such as "It is not personal. My computer hates
A solution could be to set a slop value of zero, but it is not possible in
my case (I use a module that generates compound terms with slop values, in
order to handle morphologic variations - eg in French "gestion de la casse"
and "gestion des casses" which are represented by "gestion casse"^3 and
This involves creating a subclasse of PhraseQuery or modifing it by adding a
boolean to it and modifying the phraseFreq() method so that it checks that
there is no Token with a punctuation Type in the scope of the slop.
What do you think about it? Has anyone already tried in that direction? Does
it implies heavy changes?
Hugo : maybe you could store your stopwords as tokens with a different type?
----- Original Message -----
From: "hugo burm" <hugob [at] xs4all>
To: <lucene-user [at] jakarta>
Sent: Wednesday, February 13, 2002 5:32 PM
Subject: How does Lucene handle phrases containing words that are not
> How does Lucene handle phrases (literals) containing words that are not
> indexed? (e.g. stopwords, one-letter words, numbers)? I did some tests
> (lucene demo, my own 120000 xml documents, Cocoon search) and in all cases
> it looks like that when you are looking for the phrase "a specification"
> also finds documents which contain "the specification". (or: "D.
> instead of "G. Washington").
> Of course you can change the index behaviour and make sure there are no
> stopwords, and all one-letter words and numbers are indexed. But that
> a bad approach. A better approach: 1) find all indexed words in the phrase
> and from these words find all documents containing these words. 2) check
> occurence of the phrase by opening the original document. I am wondering:
> does Lucene performs step 2)? Off course this step burns some cpu cycles.
> hugob [at] xs4all
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe [at] jakarta>
> For additional commands, e-mail:
<mailto:lucene-user-help [at] jakarta>
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe [at] jakarta>
For additional commands, e-mail: <mailto:lucene-dev-help [at] jakarta>