Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Case Sensitivity

 

 

First page Previous page 1 2 Next page Last page  View All Lucene java-user RSS feed   Index | Next | Previous | View Threaded


ArunaR at opin

Jan 21, 2002, 3:04 PM

Post #1 of 47 (1126 views)
Permalink
Case Sensitivity

Hi All,
I have noticed that I can not search using capital letters for some reason.
If I try to do a search on "SPINAL CORD" and if I use a query like SPI* AND
COR*, I get no results back. If I use lowercase (spi* AND cor*) however, I
get the results back. I am using a standard analyzer. Does anyone know why?
Thanks!

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help[at]jakarta.apache.org>


brian at quiotix

Jan 21, 2002, 3:09 PM

Post #2 of 47 (1100 views)
Permalink
Re: Case Sensitivity [In reply to]

> I have noticed that I can not search using capital letters for some reason.
> If I try to do a search on "SPINAL CORD" and if I use a query like SPI* AND
> COR*, I get no results back. If I use lowercase (spi* AND cor*) however, I
> get the results back. I am using a standard analyzer. Does anyone know why?
> Thanks!

You need to use the _same_ analyzer for analyzing the documents when
indexing them as you do when you parse the query. You may be using
a different analyzer for tokenization than for query parsing...

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help[at]jakarta.apache.org>


DCutting at grandcentral

Jan 21, 2002, 3:58 PM

Post #3 of 47 (1097 views)
Permalink
RE: Case Sensitivity [In reply to]

Wildcard queries are case sensitive, while other queries depend on the
analyzer used for the field searched. The standard analyzer lowercases, so
lowercased terms are indexed. Thus your "SPINAL CORD" query is lowercased
and matches the indexed terms "spinal" and "cord". However, since prefixes
should not be stemmed they are not run through an analyzer and are hence
case sensitive. Your index contains no terms starting with "SPI" or "COR",
since all terms were lowercased when indexed.

This question is frequent enough that we should probably fix it. Perhaps a
method should be added Analyzer:
public boolean isLowercased(String fieldName);
When this is true, the query parser could lowercase prefix and range query
terms. Fellow Lucene developers, what do you think of that?

Doug

> -----Original Message-----
> From: Aruna Raghavan [mailto:ArunaR[at]opin.com]
> Sent: Monday, January 21, 2002 2:05 PM
> To: Lucene Users List
> Subject: Case Sensitivity
>
>
> Hi All,
> I have noticed that I can not search using capital letters
> for some reason.
> If I try to do a search on "SPINAL CORD" and if I use a query
> like SPI* AND
> COR*, I get no results back. If I use lowercase (spi* AND
> cor*) however, I
> get the results back. I am using a standard analyzer. Does
> anyone know why?
> Thanks!
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help[at]jakarta.apache.org>
>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help[at]jakarta.apache.org>


brian at quiotix

Jan 21, 2002, 4:12 PM

Post #4 of 47 (1100 views)
Permalink
Re: Case Sensitivity [In reply to]

> Wildcard queries are case sensitive, while other queries depend on the
> analyzer used for the field searched. The standard analyzer lowercases, so
> lowercased terms are indexed. Thus your "SPINAL CORD" query is lowercased
> and matches the indexed terms "spinal" and "cord". However, since prefixes
> should not be stemmed they are not run through an analyzer and are hence
> case sensitive. Your index contains no terms starting with "SPI" or "COR",
> since all terms were lowercased when indexed.
>
> This question is frequent enough that we should probably fix it. Perhaps a
> method should be added Analyzer:
> public boolean isLowercased(String fieldName);
> When this is true, the query parser could lowercase prefix and range query
> terms. Fellow Lucene developers, what do you think of that?

Something should be done, but I'm not sure this is the best way to do
this. Perhaps extend Analyzer to work in two modes;
"tokenization-only" and "tokenization + term normalization".



--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help[at]jakarta.apache.org>


michal.plechawski at nutechsolutions

Jan 22, 2002, 1:13 AM

Post #5 of 47 (1099 views)
Permalink
Re: Case Sensitivity - and more [In reply to]

Hi,

I have never written anything to the list but in fact, I am doing some
development using Lucene.
I think that Brian's idea is more flexible and extendable. In my
application, I need three or more kinds of analyzers: for counting tfidf
statistics, for indexing (compute more, e.g. summaries) and for document
classification (compute document-to-class assignment and store outside the
index) and for some minor things.
My experience shows that in complex Lucene applications there is a
substantial need for many different Analyzers or - better solution - many
faces of the same Analyzer in the same time. Something should be done
here.

Another story is - why did you put document deletion to IndexReader? I guess
the main reason was the implementation, but from the API point of view it is
horrible. I've got an abstraction 'Index' in my code with both add/remove
operations, and switching between IndexReader and IndexWriter is not a thing
I like the best, and I am forced now to add some cache for performance. I
think one of the reasons is an unconsequent document id support - in delete
there is an assumption, that documents may be uniquely identified, and in
IndexWriter there is nothing like that. I think it should be very helpful
for us developers to add id to documents, but may be very hard to implement.

Last thing - did you ever think about adding transactions to Lucene? May be
very simple exclusive-write transactions - e.g. reads are not transacted nor
isolated, and writes are done in such a way - the write is exclusive (I
guess it is in 1.2, I use 1.0), and one may commit/rollback all changes made
during last session. Would it be hard?

With all these issues added, Lucene would be mature enough to be used as an
indexing engine in mission-critical applications.

Regards,
Michal



----- Original Message -----
From: "Brian Goetz" <brian[at]quiotix.com>
To: "Lucene Users List" <lucene-user[at]jakarta.apache.org>
Sent: Tuesday, January 22, 2002 12:12 AM
Subject: Re: Case Sensitivity


> > Wildcard queries are case sensitive, while other queries depend on the
> > analyzer used for the field searched. The standard analyzer lowercases,
so
> > lowercased terms are indexed. Thus your "SPINAL CORD" query is
lowercased
> > and matches the indexed terms "spinal" and "cord". However, since
prefixes
> > should not be stemmed they are not run through an analyzer and are hence
> > case sensitive. Your index contains no terms starting with "SPI" or
"COR",
> > since all terms were lowercased when indexed.
> >
> > This question is frequent enough that we should probably fix it.
Perhaps a
> > method should be added Analyzer:
> > public boolean isLowercased(String fieldName);
> > When this is true, the query parser could lowercase prefix and range
query
> > terms. Fellow Lucene developers, what do you think of that?
>
> Something should be done, but I'm not sure this is the best way to do
> this. Perhaps extend Analyzer to work in two modes;
> "tokenization-only" and "tokenization + term normalization".
>
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help[at]jakarta.apache.org>
>
>


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help[at]jakarta.apache.org>


DCutting at grandcentral

Jan 24, 2002, 11:42 AM

Post #6 of 47 (1099 views)
Permalink
RE: Case Sensitivity - and more [In reply to]

> From: Michal Plechawski
>
> I think that Brian's idea is more flexible and extendable. In my
> application, I need three or more kinds of analyzers: for
> counting tfidf
> statistics, for indexing (compute more, e.g. summaries) and
> for document
> classification (compute document-to-class assignment and
> store outside the
> index) and for some minor things.
> My experience shows that in complex Lucene applications there is a
> substantial need for many different Analyzers or - better
> solution - many
> faces of the same Analyzer in the same time. Something should be done
> here.

Currently it is easy to use different analyzers for different purposes, no?
I'm not sure how Brian's proposal (bi-modal analyzers: tokenize only &
tokenize+normalize) addresses your needs.

> Another story is - why did you put document deletion to
> IndexReader? I guess
> the main reason was the implementation, but from the API
> point of view it is
> horrible.

Yes, sorry. I wonder if it would have been better to instead call
IndexWriter IndexAdder or something, to make clear that it can only add
documents. Perhaps someday this can be fixed.

> Last thing - did you ever think about adding transactions to
> Lucene? May be
> very simple exclusive-write transactions - e.g. reads are not
> transacted nor
> isolated, and writes are done in such a way - the write is
> exclusive (I
> guess it is in 1.2, I use 1.0), and one may commit/rollback
> all changes made
> during last session. Would it be hard?

That is in fact what is done in 1.2.

Doug

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help[at]jakarta.apache.org>


michal.plechawski at nutechsolutions

Jan 25, 2002, 2:53 AM

Post #7 of 47 (1097 views)
Permalink
Re: Case Sensitivity - and more [In reply to]

> Currently it is easy to use different analyzers for different purposes,
no?
> I'm not sure how Brian's proposal (bi-modal analyzers: tokenize only &
> tokenize+normalize) addresses your needs.

Ok, maybe I misled a point a bit. But Brian's proposal as I see it was to
_group_ two tokenizers that differ in a single thing. For the query parser,
it would use TWO analyzers, one for things that need normalization and
another for things that need no normalization. It is extremely important,
that these two analyzers are compatible (ie. differ only in normalization
field), especially for applications juggling with many types of analyzers
(eg. multilingual). May not happen that normalized analyzer is English and
unnormalized is German for example, and Lucene API should support dealing
with these (giving something like Analyzers class with two parts
normalized() and unnormalized() or something like this).

> Yes, sorry. I wonder if it would have been better to instead call
> IndexWriter IndexAdder or something, to make clear that it can only add
> documents. Perhaps someday this can be fixed.

I agree it would be better to call it IndexAdder. I guess that this is a
major architectural change to add a possibility to:
1) identify the doc with a numeric unique id
2) to check that this id is unique
3) to make it possible to delete the document with a given id calling an
IndexWriter method
Ok, can live without this, but the document uniqueness and identification
would be very helpful for any "mission-critical" applications of Lucene,
where it is unacceptable to have document repetitions and where the index
change quite often.

> That is in fact what is done in 1.2.

Thanks, I didn't know.

Regards,
Michal


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help[at]jakarta.apache.org>


brian at quiotix

Jan 25, 2002, 3:24 AM

Post #8 of 47 (1100 views)
Permalink
Re: Case Sensitivity - and more [In reply to]

> Ok, maybe I misled a point a bit. But Brian's proposal as I see it was to
> _group_ two tokenizers that differ in a single thing.

I don't think that's what I was proposing... I was recognizing that
sometimes the analysis process is a composite one, and I was advocating
that the composition be made explicit since there are some cases where
only tokenization, but not normalization, is desired.


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help[at]jakarta.apache.org>


michal.plechawski at nutechsolutions

Jan 25, 2002, 11:21 AM

Post #9 of 47 (1101 views)
Permalink
Re: Case Sensitivity - and more [In reply to]

That's one of ways to make the analysis composition explicit. Another way is
to make Analyzer interface to return two token streams: normalizedStream()
and unnormalizedStream(). I won't argue which is better.
BTW: great thanks for adding possibility of analyzing different fields with
different token streams in 1.2, that was the real problem in 1.0.

Michal

----- Original Message -----
From: "Brian Goetz" <brian[at]quiotix.com>
To: "Lucene Users List" <lucene-user[at]jakarta.apache.org>
Sent: Friday, January 25, 2002 11:24 AM
Subject: Re: Case Sensitivity - and more


> > Ok, maybe I misled a point a bit. But Brian's proposal as I see it was
to
> > _group_ two tokenizers that differ in a single thing.
>
> I don't think that's what I was proposing... I was recognizing that
> sometimes the analysis process is a composite one, and I was advocating
> that the composition be made explicit since there are some cases where
> only tokenization, but not normalization, is desired.
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help[at]jakarta.apache.org>
>
>


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help[at]jakarta.apache.org>


ArunaR at opin

Apr 3, 2002, 12:27 PM

Post #10 of 47 (1101 views)
Permalink
RE: Case Sensitivity [In reply to]

Hi,
I worked around the problem by converting everything to lowercase in my code
prior to indexing into lucene and also prior to searching for a string.
Ofcourse, I also had to use pattern matching to change bool operators such
as ANDs and ORs to uppercase again because lucene expects those to be
uppercase.

-----Original Message-----
From: Alan Weissman [mailto:aweissman[at]clientelligence.net]
Sent: Wednesday, April 03, 2002 1:26 PM
To: Lucene Users List
Subject: Case Sensitivity


What can I do to configure Lucene to make in case insensitive?

Thanks,
Alan


--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help[at]jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help[at]jakarta.apache.org>


ArunaR at opin

Apr 3, 2002, 12:38 PM

Post #11 of 47 (1101 views)
Permalink
RE: Case Sensitivity [In reply to]

Hi,
I am using StandardAnalyzer - the problem was with wildcard queries being
case sensitive. Even with Standard Analyzer, you have to worry about case
sensitivity in this case. Thanks for the tip on example Analyzer, I will
take a peek.

-----Original Message-----
From: Joshua O'Madadhain [mailto:jmadden[at]ics.uci.edu]
Sent: Wednesday, April 03, 2002 1:40 PM
To: Lucene Users List
Subject: RE: Case Sensitivity


Alan, Aruna:

The built-in solution is to use LowerCaseFilter in your Analyzer. (The
SimpleAnalyzer, StopAnalyzer, and StandardAnalyzer classes already do
this; see the Lucene API docs to see which filters each uses.) The FAQ
includes an example implementation of an Analyzer if you want to build
your own.

Joshua

jmadden[at]ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden
Joshua Madden: Information Scientist, Musician, Philosopher-At-Tall
It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.

On Wed, 3 Apr 2002, Aruna Raghavan wrote:

> Hi,
> I worked around the problem by converting everything to lowercase in my
code
> prior to indexing into lucene and also prior to searching for a string.
> Ofcourse, I also had to use pattern matching to change bool operators such
> as ANDs and ORs to uppercase again because lucene expects those to be
> uppercase.
>
> -----Original Message-----
> From: Alan Weissman [mailto:aweissman[at]clientelligence.net]
> Sent: Wednesday, April 03, 2002 1:26 PM
> To: Lucene Users List
> Subject: Case Sensitivity
>
>
> What can I do to configure Lucene to make in case insensitive?
>
> Thanks,
> Alan
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help[at]jakarta.apache.org>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help[at]jakarta.apache.org>
>
>


--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help[at]jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help[at]jakarta.apache.org>


jmadden at ics

Apr 3, 2002, 12:40 PM

Post #12 of 47 (1099 views)
Permalink
RE: Case Sensitivity [In reply to]

Alan, Aruna:

The built-in solution is to use LowerCaseFilter in your Analyzer. (The
SimpleAnalyzer, StopAnalyzer, and StandardAnalyzer classes already do
this; see the Lucene API docs to see which filters each uses.) The FAQ
includes an example implementation of an Analyzer if you want to build
your own.

Joshua

jmadden[at]ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden
Joshua Madden: Information Scientist, Musician, Philosopher-At-Tall
It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.

On Wed, 3 Apr 2002, Aruna Raghavan wrote:

> Hi,
> I worked around the problem by converting everything to lowercase in my code
> prior to indexing into lucene and also prior to searching for a string.
> Ofcourse, I also had to use pattern matching to change bool operators such
> as ANDs and ORs to uppercase again because lucene expects those to be
> uppercase.
>
> -----Original Message-----
> From: Alan Weissman [mailto:aweissman[at]clientelligence.net]
> Sent: Wednesday, April 03, 2002 1:26 PM
> To: Lucene Users List
> Subject: Case Sensitivity
>
>
> What can I do to configure Lucene to make in case insensitive?
>
> Thanks,
> Alan
>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help[at]jakarta.apache.org>
>
> --
> To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-user-help[at]jakarta.apache.org>
>
>


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe[at]jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help[at]jakarta.apache.org>


dckorah at gmail

Aug 13, 2008, 9:15 AM

Post #13 of 47 (1077 views)
Permalink
RE: Case Sensitivity [In reply to]

Also would like to highlight the version of Lucene I am using; It is 2.0.0.

_____

From: Dino Korah [mailto:dckorah[at]gmail.com]
Sent: 13 August 2008 17:10
To: 'java-user[at]lucene.apache.org'
Subject: Case Sensitivity


Hi All,

Once I index a bunch of documents with a StandardAnalyzer (and if the effort
I need to put in to reindex the documents is not worth the effort), is there
a way to search on the index without case sensitivity.
I do not use any sophisticated Analyzer that makes use of
LowerCaseTokenizer.

Please let me know if there is a solution to circumvent this case
sensitivity problem.

Many thanks
Dino


sarowe at syr

Aug 13, 2008, 9:27 AM

Post #14 of 47 (1074 views)
Permalink
RE: Case Sensitivity [In reply to]

Hi Dino,

StandardAnalyzer incorporates StandardTokenizer, StandardFilter, LowerCaseFilter, and StopFilter. Any index you create using it will only provide case-insensitive matching.

Steve

On 08/13/2008 at 12:15 PM, Dino Korah wrote:
> Also would like to highlight the version of Lucene I am
> using; It is 2.0.0.
>
> _____
>
> From: Dino Korah [mailto:dckorah[at]gmail.com]
> Sent: 13 August 2008 17:10
> To: 'java-user[at]lucene.apache.org'
> Subject: Case Sensitivity
>
>
> Hi All,
>
> Once I index a bunch of documents with a StandardAnalyzer (and if the
> effort I need to put in to reindex the documents is not worth the
> effort), is there a way to search on the index without case sensitivity.
> I do not use any sophisticated Analyzer that makes use of
> LowerCaseTokenizer.
>
> Please let me know if there is a solution to circumvent this case
> sensitivity problem.
>
> Many thanks
> Dino
>
>




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


erickerickson at gmail

Aug 13, 2008, 9:47 AM

Post #15 of 47 (1074 views)
Permalink
Re: Case Sensitivity [In reply to]

What analyzer are you using at *query* time? I suspect that's where your
problem lies if you indeed "don't use any sophisticated analyzers", since
you *are* using a sophisticated analyzer at index time. You almost
invariably want to use the same analyzer at query time and analyzer time.

Please start a separate thread with your second question. Google
"Thread Hijacking" for the explanation of why that's a good idea.

Best
Erick

On Wed, Aug 13, 2008 at 12:27 PM, Steven A Rowe <sarowe[at]syr.edu> wrote:

> Hi Dino,
>
> StandardAnalyzer incorporates StandardTokenizer, StandardFilter,
> LowerCaseFilter, and StopFilter. Any index you create using it will only
> provide case-insensitive matching.
>
> Steve
>
> On 08/13/2008 at 12:15 PM, Dino Korah wrote:
> > Also would like to highlight the version of Lucene I am
> > using; It is 2.0.0.
> >
> > _____
> >
> > From: Dino Korah [mailto:dckorah[at]gmail.com]
> > Sent: 13 August 2008 17:10
> > To: 'java-user[at]lucene.apache.org'
> > Subject: Case Sensitivity
> >
> >
> > Hi All,
> >
> > Once I index a bunch of documents with a StandardAnalyzer (and if the
> > effort I need to put in to reindex the documents is not worth the
> > effort), is there a way to search on the index without case sensitivity.
> > I do not use any sophisticated Analyzer that makes use of
> > LowerCaseTokenizer.
> >
> > Please let me know if there is a solution to circumvent this case
> > sensitivity problem.
> >
> > Many thanks
> > Dino
> >
> >
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>
>


ksmmlist at gmail

Aug 14, 2008, 7:31 AM

Post #16 of 47 (1047 views)
Permalink
Re: Case Sensitivity [In reply to]

Thanks for you reply Erick.


> About the only way to do this that I know of is to
> index the data three times, once without any case
> changing, once uppercased and once lowercased.
> You'll have to watch your analyzer, probably making
> up your own (easily done, see the synonym analyzer
> in Lucene in Action).
>
> Your example doesn't tell us anything, since the critical
> information is the *analyzer* you use, both at query and
> at index times. The analyzer is responsible for any
> transformations, like case folding, tokenizing, etc.


In example I want to show what I stored field as Field.Index.NO_NORMS

As I understand it means what field contains original string
despite what analyzer I chose(StandardAnalyzer by default).

All querys I made myself without using Parsers.
For example new TermQuery(new Term(“filed”, “MaMa”));


I agree with you about possible implementation,
but it increase size of index at times.

But are there other possibilities, such as using custom query, possibly
similar to RegexQuery,RegexTermEnum that would compare terms
at it's own discretion?



>
> But what is your use-case for needing both upper and
> lower case comparisons? I have a hard time coming
> up with a reason to do both that wouldn't be satisfied
> by just a caseless search.
>
> Best
> Erick
>
> On Thu, Aug 14, 2008 at 4:47 AM, Sergey Kabashnyuk
> <ksmmlist[at]gmail.com>wrote:
>
>> Hello.
>>
>> I have the similar question.
>>
>> I need to implement
>> 1. Case sensitive search.
>> 2. Lower case search for concrete field.
>> 3. Upper case search for concrete filed.
>>
>> For now I use
>> new Field("PROPERTIES",
>> content,
>> Field.Store.NO,
>> Field.Index.NO_NORMS,
>> Field.TermVector.NO)
>> for original string and make case sensitive search.
>>
>> But does anyone have an idea to how implement second and third type of
>> search?
>>
>> Thanks
>>
>>
>>
>> Hi All,
>>> Once I index a bunch of documents with a StandardAnalyzer (and if the
>>> effort
>>> I need to put in to reindex the documents is not worth the effort), is
>>> there
>>> a way to search on the index without case sensitivity.
>>> I do not use any sophisticated Analyzer that makes use of
>>> LowerCaseTokenizer.
>>> Please let me know if there is a solution to circumvent this case
>>> sensitivity problem.
>>> Many thanks
>>> Dino
>>>
>>>
>> --
>> Sergey Kabashnyuk
>> eXo Platform SAS
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>>
>>
--
Sergey Kabashnyuk
eXo Platform SAS

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


cdoronc at gmail

Aug 14, 2008, 7:42 AM

Post #17 of 47 (1054 views)
Permalink
Re: Case Sensitivity [In reply to]

>
> In example I want to show what I stored field as Field.Index.NO_NORMS
>
> As I understand it means what field contains original string
> despite what analyzer I chose(StandardAnalyzer by default).
>

This would be achieved by UN_TOKENIZED.

The NO_NORMS just guides Lucene to avoid normalizing
results by document length for this field (and to avoid allocating
resources for that).

Other than that I join Erick in wondering why all three options are needed.
It would help the list to help you if you provide
a few simple examples of: document, query, expected result.

Doron


erickerickson at gmail

Aug 14, 2008, 8:17 AM

Post #18 of 47 (1046 views)
Permalink
Re: Case Sensitivity [In reply to]

Be aware that StandardAnalyzer lowercases all the input,
both at index and query times. Field.Store.YES will store
the original text without any transformations, so doc.get(<field>)
will return the original text. However, no matter what the
Field.Store value, the *indexed* tokens (using
TOKENIZED as you Field.Index.TOKENIZED)
are passed through the analyzer.

For instance, indexing "MIXed CasE TEXT" in a
field called "myfield" with Field.Store.YES,
Field.Index.TOKENIZED would index the
following tokens (with StandardAnalyzer).
mixed
case
text

and searches (with StandardAnalyzer) would match
any case in the query terms (e.g. MIXED would hit,
as would mixed as would CaSE).

However, doc.get("myfield") would return
"MIXed CasE TEXT"

As Doron said, though, a few use cases would
help us provide better answers.

Best
Erick


On Thu, Aug 14, 2008 at 10:31 AM, Sergey Kabashnyuk <ksmmlist[at]gmail.com>wrote:

> Thanks for you reply Erick.
>
>
> About the only way to do this that I know of is to
>> index the data three times, once without any case
>> changing, once uppercased and once lowercased.
>> You'll have to watch your analyzer, probably making
>> up your own (easily done, see the synonym analyzer
>> in Lucene in Action).
>>
>> Your example doesn't tell us anything, since the critical
>> information is the *analyzer* you use, both at query and
>> at index times. The analyzer is responsible for any
>> transformations, like case folding, tokenizing, etc.
>>
>
>
> In example I want to show what I stored field as Field.Index.NO_NORMS
>
> As I understand it means what field contains original string
> despite what analyzer I chose(StandardAnalyzer by default).
>
> All querys I made myself without using Parsers.
> For example new TermQuery(new Term("filed", "MaMa"));
>
>
> I agree with you about possible implementation,
> but it increase size of index at times.
>
> But are there other possibilities, such as using custom query, possibly
> similar to RegexQuery,RegexTermEnum that would compare terms
> at it's own discretion?
>
>
>
>
>
>> But what is your use-case for needing both upper and
>> lower case comparisons? I have a hard time coming
>> up with a reason to do both that wouldn't be satisfied
>> by just a caseless search.
>>
>> Best
>> Erick
>>
>> On Thu, Aug 14, 2008 at 4:47 AM, Sergey Kabashnyuk <ksmmlist[at]gmail.com
>> >wrote:
>>
>> Hello.
>>>
>>> I have the similar question.
>>>
>>> I need to implement
>>> 1. Case sensitive search.
>>> 2. Lower case search for concrete field.
>>> 3. Upper case search for concrete filed.
>>>
>>> For now I use
>>> new Field("PROPERTIES",
>>> content,
>>> Field.Store.NO,
>>> Field.Index.NO_NORMS,
>>> Field.TermVector.NO)
>>> for original string and make case sensitive search.
>>>
>>> But does anyone have an idea to how implement second and third type of
>>> search?
>>>
>>> Thanks
>>>
>>>
>>>
>>> Hi All,
>>>
>>>> Once I index a bunch of documents with a StandardAnalyzer (and if the
>>>> effort
>>>> I need to put in to reindex the documents is not worth the effort), is
>>>> there
>>>> a way to search on the index without case sensitivity.
>>>> I do not use any sophisticated Analyzer that makes use of
>>>> LowerCaseTokenizer.
>>>> Please let me know if there is a solution to circumvent this case
>>>> sensitivity problem.
>>>> Many thanks
>>>> Dino
>>>>
>>>>
>>>> --
>>> Sergey Kabashnyuk
>>> eXo Platform SAS
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
>>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>>>
>>>
>>> --
> Sergey Kabashnyuk
> eXo Platform SAS
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>
>


andre.rubin at gmail

Aug 14, 2008, 3:16 PM

Post #19 of 47 (1034 views)
Permalink
Re: Case Sensitivity [In reply to]

Sergey,

Based on a recent discussion I posted:
http://www.nabble.com/Searching-Tokenized-x-Un_tokenized-td18882569.html
, you cannot use Un_Tokenized because you can't have any analyzer run
thorugh it.

My suggestion, use a tokenized filed and a custom made Analyzer.
Haven't figure out all the details for you, but I think it's possible.

Andre

On Thu, Aug 14, 2008 at 8:17 AM, Erick Erickson <erickerickson[at]gmail.com> wrote:
> Be aware that StandardAnalyzer lowercases all the input,
> both at index and query times. Field.Store.YES will store
> the original text without any transformations, so doc.get(<field>)
> will return the original text. However, no matter what the
> Field.Store value, the *indexed* tokens (using
> TOKENIZED as you Field.Index.TOKENIZED)
> are passed through the analyzer.
>
> For instance, indexing "MIXed CasE TEXT" in a
> field called "myfield" with Field.Store.YES,
> Field.Index.TOKENIZED would index the
> following tokens (with StandardAnalyzer).
> mixed
> case
> text
>
> and searches (with StandardAnalyzer) would match
> any case in the query terms (e.g. MIXED would hit,
> as would mixed as would CaSE).
>
> However, doc.get("myfield") would return
> "MIXed CasE TEXT"
>
> As Doron said, though, a few use cases would
> help us provide better answers.
>
> Best
> Erick
>
>
> On Thu, Aug 14, 2008 at 10:31 AM, Sergey Kabashnyuk <ksmmlist[at]gmail.com>wrote:
>
>> Thanks for you reply Erick.
>>
>>
>> About the only way to do this that I know of is to
>>> index the data three times, once without any case
>>> changing, once uppercased and once lowercased.
>>> You'll have to watch your analyzer, probably making
>>> up your own (easily done, see the synonym analyzer
>>> in Lucene in Action).
>>>
>>> Your example doesn't tell us anything, since the critical
>>> information is the *analyzer* you use, both at query and
>>> at index times. The analyzer is responsible for any
>>> transformations, like case folding, tokenizing, etc.
>>>
>>
>>
>> In example I want to show what I stored field as Field.Index.NO_NORMS
>>
>> As I understand it means what field contains original string
>> despite what analyzer I chose(StandardAnalyzer by default).
>>
>> All querys I made myself without using Parsers.
>> For example new TermQuery(new Term("filed", "MaMa"));
>>
>>
>> I agree with you about possible implementation,
>> but it increase size of index at times.
>>
>> But are there other possibilities, such as using custom query, possibly
>> similar to RegexQuery,RegexTermEnum that would compare terms
>> at it's own discretion?
>>
>>
>>
>>
>>
>>> But what is your use-case for needing both upper and
>>> lower case comparisons? I have a hard time coming
>>> up with a reason to do both that wouldn't be satisfied
>>> by just a caseless search.
>>>
>>> Best
>>> Erick
>>>
>>> On Thu, Aug 14, 2008 at 4:47 AM, Sergey Kabashnyuk <ksmmlist[at]gmail.com
>>> >wrote:
>>>
>>> Hello.
>>>>
>>>> I have the similar question.
>>>>
>>>> I need to implement
>>>> 1. Case sensitive search.
>>>> 2. Lower case search for concrete field.
>>>> 3. Upper case search for concrete filed.
>>>>
>>>> For now I use
>>>> new Field("PROPERTIES",
>>>> content,
>>>> Field.Store.NO,
>>>> Field.Index.NO_NORMS,
>>>> Field.TermVector.NO)
>>>> for original string and make case sensitive search.
>>>>
>>>> But does anyone have an idea to how implement second and third type of
>>>> search?
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>> Hi All,
>>>>
>>>>> Once I index a bunch of documents with a StandardAnalyzer (and if the
>>>>> effort
>>>>> I need to put in to reindex the documents is not worth the effort), is
>>>>> there
>>>>> a way to search on the index without case sensitivity.
>>>>> I do not use any sophisticated Analyzer that makes use of
>>>>> LowerCaseTokenizer.
>>>>> Please let me know if there is a solution to circumvent this case
>>>>> sensitivity problem.
>>>>> Many thanks
>>>>> Dino
>>>>>
>>>>>
>>>>> --
>>>> Sergey Kabashnyuk
>>>> eXo Platform SAS
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
>>>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>>>>
>>>>
>>>> --
>> Sergey Kabashnyuk
>> eXo Platform SAS
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


ksmmlist at gmail

Aug 15, 2008, 12:51 AM

Post #20 of 47 (1025 views)
Permalink
Re: Case Sensitivity [In reply to]

Hello

Here's my use case content of the field
Doc1 -
Field - “text ” - “Field Without Norms”

Doc2 -
Field - “text ” - “field without norms”

Doc3 -
Field - “text ” - “FIELD WITHOUT NORMS”


Query expected result
1. new Term(“text”,”Field Without Norms”) doc1
2. new Term(“text”,”field without norms”) doc2
3. new Term(“text”,”FIELD WITHOUT NORMS”) doc3
lowercase(“text”,”field without norms”) doc1, doc2, doc3
uppercase(“text”,”FIELD WITHOUT NORMS”) doc1, doc2, doc3

I stor “text” field like :
new Field(“text”, Field.Store.NO, Field.Index.NO_NORMS,Field.TermVector.NO)
using StandardAnalyzer and query 1-3 works perfectly as I need. The
question is
how create query 4-5?

Thanks
Sergey Kabashnyuk
eXo Platform SAS


> Be aware that StandardAnalyzer lowercases all the input,
> both at index and query times. Field.Store.YES will store
> the original text without any transformations, so doc.get(<field>)
> will return the original text. However, no matter what the
> Field.Store value, the *indexed* tokens (using
> TOKENIZED as you Field.Index.TOKENIZED)
> are passed through the analyzer.
>
> For instance, indexing "MIXed CasE TEXT" in a
> field called "myfield" with Field.Store.YES,
> Field.Index.TOKENIZED would index the
> following tokens (with StandardAnalyzer).
> mixed
> case
> text
>
> and searches (with StandardAnalyzer) would match
> any case in the query terms (e.g. MIXED would hit,
> as would mixed as would CaSE).
>
> However, doc.get("myfield") would return
> "MIXed CasE TEXT"
>
> As Doron said, though, a few use cases would
> help us provide better answers.
>
> Best
> Erick
>
>
> On Thu, Aug 14, 2008 at 10:31 AM, Sergey Kabashnyuk
> <ksmmlist[at]gmail.com>wrote:
>
>> Thanks for you reply Erick.
>>
>>
>> About the only way to do this that I know of is to
>>> index the data three times, once without any case
>>> changing, once uppercased and once lowercased.
>>> You'll have to watch your analyzer, probably making
>>> up your own (easily done, see the synonym analyzer
>>> in Lucene in Action).
>>>
>>> Your example doesn't tell us anything, since the critical
>>> information is the *analyzer* you use, both at query and
>>> at index times. The analyzer is responsible for any
>>> transformations, like case folding, tokenizing, etc.
>>>
>>
>>
>> In example I want to show what I stored field as Field.Index.NO_NORMS
>>
>> As I understand it means what field contains original string
>> despite what analyzer I chose(StandardAnalyzer by default).
>>
>> All querys I made myself without using Parsers.
>> For example new TermQuery(new Term("filed", "MaMa"));
>>
>>
>> I agree with you about possible implementation,
>> but it increase size of index at times.
>>
>> But are there other possibilities, such as using custom query, possibly
>> similar to RegexQuery,RegexTermEnum that would compare terms
>> at it's own discretion?
>>
>>
>>
>>
>>
>>> But what is your use-case for needing both upper and
>>> lower case comparisons? I have a hard time coming
>>> up with a reason to do both that wouldn't be satisfied
>>> by just a caseless search.
>>>
>>> Best
>>> Erick
>>>
>>> On Thu, Aug 14, 2008 at 4:47 AM, Sergey Kabashnyuk <ksmmlist[at]gmail.com
>>> >wrote:
>>>
>>> Hello.
>>>>
>>>> I have the similar question.
>>>>
>>>> I need to implement
>>>> 1. Case sensitive search.
>>>> 2. Lower case search for concrete field.
>>>> 3. Upper case search for concrete filed.
>>>>
>>>> For now I use
>>>> new Field("PROPERTIES",
>>>> content,
>>>> Field.Store.NO,
>>>> Field.Index.NO_NORMS,
>>>> Field.TermVector.NO)
>>>> for original string and make case sensitive search.
>>>>
>>>> But does anyone have an idea to how implement second and third type of
>>>> search?
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>> Hi All,
>>>>
>>>>> Once I index a bunch of documents with a StandardAnalyzer (and if the
>>>>> effort
>>>>> I need to put in to reindex the documents is not worth the effort),
>>>>> is
>>>>> there
>>>>> a way to search on the index without case sensitivity.
>>>>> I do not use any sophisticated Analyzer that makes use of
>>>>> LowerCaseTokenizer.
>>>>> Please let me know if there is a solution to circumvent this case
>>>>> sensitivity problem.
>>>>> Many thanks
>>>>> Dino
>>>>>
>>>>>
>>>>> --
>>>> Sergey Kabashnyuk
>>>> eXo Platform SAS
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
>>>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>>>>
>>>>
>>>> --
>> Sergey Kabashnyuk
>> eXo Platform SAS
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>>
>>



--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


cdoronc at gmail

Aug 16, 2008, 1:00 PM

Post #21 of 47 (1006 views)
Permalink
Re: Case Sensitivity [In reply to]

Hi Sergey, seems like case 4 and 5 are equivalent,
both meaning case insensitive right. Otherwise please
explain the difference.

If it is required to support both case sensitive
(cases 1,2,3) and case insensitive (case 4/5) then
both forms must be saved in the index - in two separate
fields (as Erick mentioned, I think).

Hope this helps,
Doron

On Fri, Aug 15, 2008 at 10:51 AM, Sergey Kabashnyuk <ksmmlist[at]gmail.com>wrote:

> Hello
>
> Here's my use case content of the field
> Doc1 -
> Field - "text " - "Field Without Norms"
>
> Doc2 -
> Field - "text " - "field without norms"
>
> Doc3 -
> Field - "text " - "FIELD WITHOUT NORMS"
>
>
> Query expected result
> 1. new Term("text","Field Without Norms") doc1
> 2. new Term("text","field without norms") doc2
> 3. new Term("text","FIELD WITHOUT NORMS") doc3


> lowercase("text","field without norms") doc1, doc2, doc3
> uppercase("text","FIELD WITHOUT NORMS") doc1, doc2, doc3
>
> I stor "text" field like :
> new Field("text", Field.Store.NO, Field.Index.NO_NORMS,Field.TermVector.NO
> )
> using StandardAnalyzer and query 1-3 works perfectly as I need. The
> question is
> how create query 4-5?
>
> Thanks
>
> Sergey Kabashnyuk
> eXo Platform SAS
>
>
> Be aware that StandardAnalyzer lowercases all the input,
>> both at index and query times. Field.Store.YES will store
>> the original text without any transformations, so doc.get(<field>)
>> will return the original text. However, no matter what the
>> Field.Store value, the *indexed* tokens (using
>> TOKENIZED as you Field.Index.TOKENIZED)
>> are passed through the analyzer.
>>
>> For instance, indexing "MIXed CasE TEXT" in a
>> field called "myfield" with Field.Store.YES,
>> Field.Index.TOKENIZED would index the
>> following tokens (with StandardAnalyzer).
>> mixed
>> case
>> text
>>
>> and searches (with StandardAnalyzer) would match
>> any case in the query terms (e.g. MIXED would hit,
>> as would mixed as would CaSE).
>>
>> However, doc.get("myfield") would return
>> "MIXed CasE TEXT"
>>
>> As Doron said, though, a few use cases would
>> help us provide better answers.
>>
>> Best
>> Erick
>>
>>
>> On Thu, Aug 14, 2008 at 10:31 AM, Sergey Kabashnyuk <ksmmlist[at]gmail.com
>> >wrote:
>>
>> Thanks for you reply Erick.
>>>
>>>
>>> About the only way to do this that I know of is to
>>>
>>>> index the data three times, once without any case
>>>> changing, once uppercased and once lowercased.
>>>> You'll have to watch your analyzer, probably making
>>>> up your own (easily done, see the synonym analyzer
>>>> in Lucene in Action).
>>>>
>>>> Your example doesn't tell us anything, since the critical
>>>> information is the *analyzer* you use, both at query and
>>>> at index times. The analyzer is responsible for any
>>>> transformations, like case folding, tokenizing, etc.
>>>>
>>>>
>>>
>>> In example I want to show what I stored field as Field.Index.NO_NORMS
>>>
>>> As I understand it means what field contains original string
>>> despite what analyzer I chose(StandardAnalyzer by default).
>>>
>>> All querys I made myself without using Parsers.
>>> For example new TermQuery(new Term("filed", "MaMa"));
>>>
>>>
>>> I agree with you about possible implementation,
>>> but it increase size of index at times.
>>>
>>> But are there other possibilities, such as using custom query, possibly
>>> similar to RegexQuery,RegexTermEnum that would compare terms
>>> at it's own discretion?
>>>
>>>
>>>
>>>
>>>
>>> But what is your use-case for needing both upper and
>>>> lower case comparisons? I have a hard time coming
>>>> up with a reason to do both that wouldn't be satisfied
>>>> by just a caseless search.
>>>>
>>>> Best
>>>> Erick
>>>>
>>>> On Thu, Aug 14, 2008 at 4:47 AM, Sergey Kabashnyuk <ksmmlist[at]gmail.com
>>>> >wrote:
>>>>
>>>> Hello.
>>>>
>>>>>
>>>>> I have the similar question.
>>>>>
>>>>> I need to implement
>>>>> 1. Case sensitive search.
>>>>> 2. Lower case search for concrete field.
>>>>> 3. Upper case search for concrete filed.
>>>>>
>>>>> For now I use
>>>>> new Field("PROPERTIES",
>>>>> content,
>>>>> Field.Store.NO,
>>>>> Field.Index.NO_NORMS,
>>>>> Field.TermVector.NO)
>>>>> for original string and make case sensitive search.
>>>>>
>>>>> But does anyone have an idea to how implement second and third type of
>>>>> search?
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>> Hi All,
>>>>>
>>>>> Once I index a bunch of documents with a StandardAnalyzer (and if the
>>>>>> effort
>>>>>> I need to put in to reindex the documents is not worth the effort), is
>>>>>> there
>>>>>> a way to search on the index without case sensitivity.
>>>>>> I do not use any sophisticated Analyzer that makes use of
>>>>>> LowerCaseTokenizer.
>>>>>> Please let me know if there is a solution to circumvent this case
>>>>>> sensitivity problem.
>>>>>> Many thanks
>>>>>> Dino
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>> Sergey Kabashnyuk
>>>>> eXo Platform SAS
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>>>>>
>>>>>
>>>>> --
>>>>>
>>>> Sergey Kabashnyuk
>>> eXo Platform SAS
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
>>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>>>
>>>
>>>
>
>
> --
> Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>
>


dckorah at gmail

Aug 19, 2008, 5:57 AM

Post #22 of 47 (957 views)
Permalink
RE: Case Sensitivity [In reply to]

Hi Guys,

From the discussion here what I could understand was, if I am using
StandardAnalyzer on TOKENIZED fields, for both Indexing and Querying, I
shouldn't have any problems with cases. But if I have any UN_TOKENIZED
fields there will be problems if I do not case-normalize them myself before
adding them as a field to the document.

In my case I have a mixed scenario. I am indexing emails and the email
addresses are indexed UN_TOKENIZED. I do have a second set of custom
tokenized field, which keep the tokens in individual fields with same name.

For example, if the email had a from address "John Smith"
<J.Smith[at]world.net>, my document looks like this

------------------8<----------------
to: ... - UN_TOKENIZED
from: J.Smith[at]world.net - UN_TOKENIZED
From-tokenized: John - UN_TOKENIZED
From-tokenized: Smith - UN_TOKENIZED
From-tokenized: J - UN_TOKENIZED
From-tokenized: Smith - UN_TOKENIZED
From-tokenized: world.net - UN_TOKENIZED
From-tokenized: world - UN_TOKENIZED
From-tokenized: net - UN_TOKENIZED
Subject: ... - TOKENIZED
Body: ... - TOKENIZED
------------------8<----------------

Does it mean that where ever I use UN_TOKENIZED, they do not get through the
StandardAnalyzer before getting Indexed, but they do when they are searched
on? If that is the case, Do I need to normalise them before adding to
document?

I also would like to know if it is better to employ an EmailAnalyzer that
makes a TokenStream out of the given email address, rather than using a
simplistic function that gives me a list of string pieces and adding them
one by one. With searches, would both the approaches give same result?

Many thanks,
Dino



-----Original Message-----
From: Doron Cohen [mailto:cdoronc[at]gmail.com]
Sent: 16 August 2008 21:01
To: java-user[at]lucene.apache.org
Subject: Re: Case Sensitivity

Hi Sergey, seems like case 4 and 5 are equivalent, both meaning case
insensitive right. Otherwise please explain the difference.

If it is required to support both case sensitive (cases 1,2,3) and case
insensitive (case 4/5) then both forms must be saved in the index - in two
separate fields (as Erick mentioned, I think).

Hope this helps,
Doron

On Fri, Aug 15, 2008 at 10:51 AM, Sergey Kabashnyuk
<ksmmlist[at]gmail.com>wrote:

> Hello
>
> Here's my use case content of the field
> Doc1 -
> Field - "text " - "Field Without Norms"
>
> Doc2 -
> Field - "text " - "field without norms"
>
> Doc3 -
> Field - "text " - "FIELD WITHOUT NORMS"
>
>
> Query expected result
> 1. new Term("text","Field Without Norms") doc1
> 2. new Term("text","field without norms") doc2
> 3. new Term("text","FIELD WITHOUT NORMS") doc3


> lowercase("text","field without norms") doc1, doc2, doc3
> uppercase("text","FIELD WITHOUT NORMS") doc1, doc2, doc3
>
> I stor "text" field like :
> new Field("text", Field.Store.NO,
> Field.Index.NO_NORMS,Field.TermVector.NO
> )
> using StandardAnalyzer and query 1-3 works perfectly as I need. The
> question is how create query 4-5?
>
> Thanks
>
> Sergey Kabashnyuk
> eXo Platform SAS
>
>
> Be aware that StandardAnalyzer lowercases all the input,
>> both at index and query times. Field.Store.YES will store the
>> original text without any transformations, so doc.get(<field>) will
>> return the original text. However, no matter what the Field.Store
>> value, the *indexed* tokens (using TOKENIZED as you
>> Field.Index.TOKENIZED) are passed through the analyzer.
>>
>> For instance, indexing "MIXed CasE TEXT" in a field called "myfield"
>> with Field.Store.YES, Field.Index.TOKENIZED would index the following
>> tokens (with StandardAnalyzer).
>> mixed
>> case
>> text
>>
>> and searches (with StandardAnalyzer) would match any case in the
>> query terms (e.g. MIXED would hit, as would mixed as would CaSE).
>>
>> However, doc.get("myfield") would return "MIXed CasE TEXT"
>>
>> As Doron said, though, a few use cases would help us provide better
>> answers.
>>
>> Best
>> Erick
>>
>>
>> On Thu, Aug 14, 2008 at 10:31 AM, Sergey Kabashnyuk
>> <ksmmlist[at]gmail.com
>> >wrote:
>>
>> Thanks for you reply Erick.
>>>
>>>
>>> About the only way to do this that I know of is to
>>>
>>>> index the data three times, once without any case changing, once
>>>> uppercased and once lowercased.
>>>> You'll have to watch your analyzer, probably making up your own
>>>> (easily done, see the synonym analyzer in Lucene in Action).
>>>>
>>>> Your example doesn't tell us anything, since the critical
>>>> information is the *analyzer* you use, both at query and at index
>>>> times. The analyzer is responsible for any transformations, like
>>>> case folding, tokenizing, etc.
>>>>
>>>>
>>>
>>> In example I want to show what I stored field as
>>> Field.Index.NO_NORMS
>>>
>>> As I understand it means what field contains original string despite
>>> what analyzer I chose(StandardAnalyzer by default).
>>>
>>> All querys I made myself without using Parsers.
>>> For example new TermQuery(new Term("filed", "MaMa"));
>>>
>>>
>>> I agree with you about possible implementation, but it increase size
>>> of index at times.
>>>
>>> But are there other possibilities, such as using custom query,
>>> possibly similar to RegexQuery,RegexTermEnum that would compare
>>> terms at it's own discretion?
>>>
>>>
>>>
>>>
>>>
>>> But what is your use-case for needing both upper and
>>>> lower case comparisons? I have a hard time coming up with a reason
>>>> to do both that wouldn't be satisfied by just a caseless search.
>>>>
>>>> Best
>>>> Erick
>>>>
>>>> On Thu, Aug 14, 2008 at 4:47 AM, Sergey Kabashnyuk
>>>> <ksmmlist[at]gmail.com
>>>> >wrote:
>>>>
>>>> Hello.
>>>>
>>>>>
>>>>> I have the similar question.
>>>>>
>>>>> I need to implement
>>>>> 1. Case sensitive search.
>>>>> 2. Lower case search for concrete field.
>>>>> 3. Upper case search for concrete filed.
>>>>>
>>>>> For now I use
>>>>> new Field("PROPERTIES",
>>>>> content,
>>>>> Field.Store.NO,
>>>>> Field.Index.NO_NORMS,
>>>>> Field.TermVector.NO) for original string and make
>>>>> case sensitive search.
>>>>>
>>>>> But does anyone have an idea to how implement second and third
>>>>> type of search?
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>> Hi All,
>>>>>
>>>>> Once I index a bunch of documents with a StandardAnalyzer (and if
>>>>> the
>>>>>> effort
>>>>>> I need to put in to reindex the documents is not worth the
>>>>>> effort), is there a way to search on the index without case
>>>>>> sensitivity.
>>>>>> I do not use any sophisticated Analyzer that makes use of
>>>>>> LowerCaseTokenizer.
>>>>>> Please let me know if there is a solution to circumvent this case
>>>>>> sensitivity problem.
>>>>>> Many thanks
>>>>>> Dino
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>> Sergey Kabashnyuk
>>>>> eXo Platform SAS
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------
>>>>> --- To unsubscribe, e-mail:
>>>>> java-user-unsubscribe[at]lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>>>>>
>>>>>
>>>>> --
>>>>>
>>>> Sergey Kabashnyuk
>>> eXo Platform SAS
>>>
>>> --------------------------------------------------------------------
>>> - To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
>>> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>>>
>>>
>>>
>
>
> --
> Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


sarowe at syr

Aug 19, 2008, 9:42 AM

Post #23 of 47 (956 views)
Permalink
RE: Case Sensitivity [In reply to]

Hi Dino,

I think you'd benefit from reading some FAQ answers, like:

"Why is it important to use the same analyzer type during indexing and search?"
<http://wiki.apache.org/lucene-java/LuceneFAQ#head-0f374b0fe1483c90fe7d6f2c44472d10961ba63c>

Also, have a look at the AnalysisParalysis wiki page for some hints:
<http://wiki.apache.org/lucene-java/AnalysisParalysis>

On 08/19/2008 at 8:57 AM, Dino Korah wrote:
> From the discussion here what I could understand was, if I am using
> StandardAnalyzer on TOKENIZED fields, for both Indexing and Querying,
> I shouldn't have any problems with cases.

If by "shouldn't have problems with cases" you mean "can match case-insensitively", then this is true.

> But if I have any UN_TOKENIZED fields there will be problems if I do
> not case-normalize them myself before adding them as a field to the
> document.

Again, assuming that by "case-normalize" you mean "downcase", and that you want case-insensitive matching, and that you use the StandardAnalyzer (or some other downcasing analyzer) at query-time, then this is true.

> In my case I have a mixed scenario. I am indexing emails and the email
> addresses are indexed UN_TOKENIZED. I do have a second set of custom
> tokenized field, which keep the tokens in individual fields
> with same name.
[...]
> Does it mean that where ever I use UN_TOKENIZED, they do not get through
> the StandardAnalyzer before getting Indexed, but they do when they are
> searched on?

This is true.

> If that is the case, Do I need to normalise them before adding to
> document?

If you want case-insensitive matching, then yes, you do need to normalize them before adding them to the document.

> I also would like to know if it is better to employ an EmailAnalyzer
> that makes a TokenStream out of the given email address, rather
> than using a simplistic function that gives me a list of string pieces
> and adding them one by one. With searches, would both the approaches
> give same result?

Yes, both approaches give the same result. When you add string pieces one-by-one, you are adding multiple same-named fields. By contrast, the EmailAnalyzer approach would add a single field, and would allow you to control positions (via Token.setPositionIncrement(): <http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/Token.html#setPositionIncrement(int)>), e.g. to improve phrase handling. Also, if you make up an EmailAnalyzer, you can use it to search against your tokenized email field, along with other analyzer(s) on other field(s), using the PerFieldAnalyzerWrapper <http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html>.

Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


dckorah at gmail

Aug 20, 2008, 8:21 AM

Post #24 of 47 (937 views)
Permalink
RE: Case Sensitivity [In reply to]

Hi Steve,

Thanks a lot for that.

I have a question on TokenStreams and email addresses, but I will post them
on a separate thread.

Many thanks,
Dino


-----Original Message-----
From: Steven A Rowe [mailto:sarowe[at]syr.edu]
Sent: 19 August 2008 17:43
To: java-user[at]lucene.apache.org
Subject: RE: Case Sensitivity

Hi Dino,

I think you'd benefit from reading some FAQ answers, like:

"Why is it important to use the same analyzer type during indexing and
search?"
<http://wiki.apache.org/lucene-java/LuceneFAQ#head-0f374b0fe1483c90fe7d6f2c4
4472d10961ba63c>

Also, have a look at the AnalysisParalysis wiki page for some hints:
<http://wiki.apache.org/lucene-java/AnalysisParalysis>

On 08/19/2008 at 8:57 AM, Dino Korah wrote:
> From the discussion here what I could understand was, if I am using
> StandardAnalyzer on TOKENIZED fields, for both Indexing and Querying,
> I shouldn't have any problems with cases.

If by "shouldn't have problems with cases" you mean "can match
case-insensitively", then this is true.

> But if I have any UN_TOKENIZED fields there will be problems if I do
> not case-normalize them myself before adding them as a field to the
> document.

Again, assuming that by "case-normalize" you mean "downcase", and that you
want case-insensitive matching, and that you use the StandardAnalyzer (or
some other downcasing analyzer) at query-time, then this is true.

> In my case I have a mixed scenario. I am indexing emails and the email
> addresses are indexed UN_TOKENIZED. I do have a second set of custom
> tokenized field, which keep the tokens in individual fields with same
> name.
[...]
> Does it mean that where ever I use UN_TOKENIZED, they do not get
> through the StandardAnalyzer before getting Indexed, but they do when
> they are searched on?

This is true.

> If that is the case, Do I need to normalise them before adding to
> document?

If you want case-insensitive matching, then yes, you do need to normalize
them before adding them to the document.

> I also would like to know if it is better to employ an EmailAnalyzer
> that makes a TokenStream out of the given email address, rather than
> using a simplistic function that gives me a list of string pieces and
> adding them one by one. With searches, would both the approaches give
> same result?

Yes, both approaches give the same result. When you add string pieces
one-by-one, you are adding multiple same-named fields. By contrast, the
EmailAnalyzer approach would add a single field, and would allow you to
control positions (via Token.setPositionIncrement():
<http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/Token.ht
ml#setPositionIncrement(int)>), e.g. to improve phrase handling. Also, if
you make up an EmailAnalyzer, you can use it to search against your
tokenized email field, along with other analyzer(s) on other field(s), using
the PerFieldAnalyzerWrapper
<http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/PerField
AnalyzerWrapper.html>.

Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
For additional commands, e-mail: java-user-help[at]lucene.apache.org


andre.rubin at gmail

Aug 21, 2008, 12:21 AM

Post #25 of 47 (931 views)
Permalink
Re: Case Sensitivity [In reply to]

Just to add to that, as I said before, in my case, I found more useful not
to use UN_Tokenized. Instead, I used Tokenized with a custom analyzer that
uses the KeywordTokenizer (entire input as only one token) with the
LowerCaseFilter: This way I get the best of both worlds.

public class KeywordLowerAnalyzer extends Analyzer {

public KeywordLowerAnalyzer() {
}


public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new KeywordTokenizer(reader);
result = new LowerCaseFilter(result);
return result;
}

}

On Wed, Aug 20, 2008 at 10:21 AM, Dino Korah <dckorah[at]gmail.com> wrote:
> Hi Steve,
>
> Thanks a lot for that.
>
> I have a question on TokenStreams and email addresses, but I will post
them
> on a separate thread.
>
> Many thanks,
> Dino
>
>
> -----Original Message-----
> From: Steven A Rowe [mailto:sarowe[at]syr.edu]
> Sent: 19 August 2008 17:43
> To: java-user[at]lucene.apache.org
> Subject: RE: Case Sensitivity
>
> Hi Dino,
>
> I think you'd benefit from reading some FAQ answers, like:
>
> "Why is it important to use the same analyzer type during indexing and
> search?"
> <
http://wiki.apache.org/lucene-java/LuceneFAQ#head-0f374b0fe1483c90fe7d6f2c4
> 4472d10961ba63c>
>
> Also, have a look at the AnalysisParalysis wiki page for some hints:
> <http://wiki.apache.org/lucene-java/AnalysisParalysis>
>
> On 08/19/2008 at 8:57 AM, Dino Korah wrote:
>> From the discussion here what I could understand was, if I am using
>> StandardAnalyzer on TOKENIZED fields, for both Indexing and Querying,
>> I shouldn't have any problems with cases.
>
> If by "shouldn't have problems with cases" you mean "can match
> case-insensitively", then this is true.
>
>> But if I have any UN_TOKENIZED fields there will be problems if I do
>> not case-normalize them myself before adding them as a field to the
>> document.
>
> Again, assuming that by "case-normalize" you mean "downcase", and that you
> want case-insensitive matching, and that you use the StandardAnalyzer (or
> some other downcasing analyzer) at query-time, then this is true.
>
>> In my case I have a mixed scenario. I am indexing emails and the email
>> addresses are indexed UN_TOKENIZED. I do have a second set of custom
>> tokenized field, which keep the tokens in individual fields with same
>> name.
> [...]
>> Does it mean that where ever I use UN_TOKENIZED, they do not get
>> through the StandardAnalyzer before getting Indexed, but they do when
>> they are searched on?
>
> This is true.
>
>> If that is the case, Do I need to normalise them before adding to
>> document?
>
> If you want case-insensitive matching, then yes, you do need to normalize
> them before adding them to the document.
>
>> I also would like to know if it is better to employ an EmailAnalyzer
>> that makes a TokenStream out of the given email address, rather than
>> using a simplistic function that gives me a list of string pieces and
>> adding them one by one. With searches, would both the approaches give
>> same result?
>
> Yes, both approaches give the same result. When you add string pieces
> one-by-one, you are adding multiple same-named fields. By contrast, the
> EmailAnalyzer approach would add a single field, and would allow you to
> control positions (via Token.setPositionIncrement():
> <
http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/Token.ht
> ml#setPositionIncrement(int)>), e.g. to improve phrase handling. Also, if
> you make up an EmailAnalyzer, you can use it to search against your
> tokenized email field, along with other analyzer(s) on other field(s),
using
> the PerFieldAnalyzerWrapper
> <
http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/PerField
> AnalyzerWrapper.html>.
>
> Steve
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe[at]lucene.apache.org
> For additional commands, e-mail: java-user-help[at]lucene.apache.org
>
>

First page Previous page 1 2 Next page Last page  View All Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact lists@gossamer-threads.com
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.