Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Case Sensitivity

 

 

First page Previous page 1 2 Next page Last page  View All Lucene java-user RSS feed   Index | Next | Previous | View Threaded


dckorah at gmail

Aug 22, 2008, 2:50 AM

Post #26 of 47 (1463 views)
Permalink
RE: Case Sensitivity [In reply to]

That is very clever. With that, the text we index will get through the
analyser, but will not get tokenized. Will hit the analyser the same way
when we search, again untokenized.

Brilliant!!


-----Original Message-----
From: Andre Rubin [mailto:andre.rubin [at] gmail]
Sent: 21 August 2008 08:21
To: java-user [at] lucene
Subject: Re: Case Sensitivity

Just to add to that, as I said before, in my case, I found more useful not
to use UN_Tokenized. Instead, I used Tokenized with a custom analyzer that
uses the KeywordTokenizer (entire input as only one token) with the
LowerCaseFilter: This way I get the best of both worlds.

public class KeywordLowerAnalyzer extends Analyzer {

public KeywordLowerAnalyzer() {
}


public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new KeywordTokenizer(reader);
result = new LowerCaseFilter(result);
return result;
}

}

On Wed, Aug 20, 2008 at 10:21 AM, Dino Korah <dckorah [at] gmail> wrote:
> Hi Steve,
>
> Thanks a lot for that.
>
> I have a question on TokenStreams and email addresses, but I will post
them
> on a separate thread.
>
> Many thanks,
> Dino
>
>
> -----Original Message-----
> From: Steven A Rowe [mailto:sarowe [at] syr]
> Sent: 19 August 2008 17:43
> To: java-user [at] lucene
> Subject: RE: Case Sensitivity
>
> Hi Dino,
>
> I think you'd benefit from reading some FAQ answers, like:
>
> "Why is it important to use the same analyzer type during indexing and
> search?"
> <
http://wiki.apache.org/lucene-java/LuceneFAQ#head-0f374b0fe1483c90fe7d6f2c4
> 4472d10961ba63c>
>
> Also, have a look at the AnalysisParalysis wiki page for some hints:
> <http://wiki.apache.org/lucene-java/AnalysisParalysis>
>
> On 08/19/2008 at 8:57 AM, Dino Korah wrote:
>> From the discussion here what I could understand was, if I am using
>> StandardAnalyzer on TOKENIZED fields, for both Indexing and Querying,
>> I shouldn't have any problems with cases.
>
> If by "shouldn't have problems with cases" you mean "can match
> case-insensitively", then this is true.
>
>> But if I have any UN_TOKENIZED fields there will be problems if I do
>> not case-normalize them myself before adding them as a field to the
>> document.
>
> Again, assuming that by "case-normalize" you mean "downcase", and that
> you want case-insensitive matching, and that you use the
> StandardAnalyzer (or some other downcasing analyzer) at query-time, then
this is true.
>
>> In my case I have a mixed scenario. I am indexing emails and the
>> email addresses are indexed UN_TOKENIZED. I do have a second set of
>> custom tokenized field, which keep the tokens in individual fields
>> with same name.
> [...]
>> Does it mean that where ever I use UN_TOKENIZED, they do not get
>> through the StandardAnalyzer before getting Indexed, but they do when
>> they are searched on?
>
> This is true.
>
>> If that is the case, Do I need to normalise them before adding to
>> document?
>
> If you want case-insensitive matching, then yes, you do need to
> normalize them before adding them to the document.
>
>> I also would like to know if it is better to employ an EmailAnalyzer
>> that makes a TokenStream out of the given email address, rather than
>> using a simplistic function that gives me a list of string pieces and
>> adding them one by one. With searches, would both the approaches give
>> same result?
>
> Yes, both approaches give the same result. When you add string pieces
> one-by-one, you are adding multiple same-named fields. By contrast,
> the EmailAnalyzer approach would add a single field, and would allow
> you to control positions (via Token.setPositionIncrement():
> <
http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/Token.ht
> ml#setPositionIncrement(int)>), e.g. to improve phrase handling.
> Also, if you make up an EmailAnalyzer, you can use it to search
> against your tokenized email field, along with other analyzer(s) on
> other field(s),
using
> the PerFieldAnalyzerWrapper
> <
http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/PerField
> AnalyzerWrapper.html>.
>
> Steve
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


dckorah at gmail

Aug 26, 2008, 4:11 AM

Post #27 of 47 (1444 views)
Permalink
RE: Case Sensitivity [In reply to]

A little more case sensitivity questions.

Based on the discussion on http://markmail.org/message/q7dqr4r7o6t6dgo5 and
on this thread, is it right to say that a field, if either UN_TOKENIZED or
NO_NORMS-ized, it doesn't get analyzed while indexing? Which means we need
to case-normalize (down-case) those fields before hand?

Doest it mean that if I can afford, I should use norms.

Many thanks,
Dino



-----Original Message-----
From: Steven A Rowe [mailto:sarowe [at] syr]
Sent: 19 August 2008 17:43
To: java-user [at] lucene
Subject: RE: Case Sensitivity

Hi Dino,

I think you'd benefit from reading some FAQ answers, like:

"Why is it important to use the same analyzer type during indexing and
search?"
<http://wiki.apache.org/lucene-java/LuceneFAQ#head-0f374b0fe1483c90fe7d6f2c4
4472d10961ba63c>

Also, have a look at the AnalysisParalysis wiki page for some hints:
<http://wiki.apache.org/lucene-java/AnalysisParalysis>

On 08/19/2008 at 8:57 AM, Dino Korah wrote:
> From the discussion here what I could understand was, if I am using
> StandardAnalyzer on TOKENIZED fields, for both Indexing and Querying,
> I shouldn't have any problems with cases.

If by "shouldn't have problems with cases" you mean "can match
case-insensitively", then this is true.

> But if I have any UN_TOKENIZED fields there will be problems if I do
> not case-normalize them myself before adding them as a field to the
> document.

Again, assuming that by "case-normalize" you mean "downcase", and that you
want case-insensitive matching, and that you use the StandardAnalyzer (or
some other downcasing analyzer) at query-time, then this is true.

> In my case I have a mixed scenario. I am indexing emails and the email
> addresses are indexed UN_TOKENIZED. I do have a second set of custom
> tokenized field, which keep the tokens in individual fields with same
> name.
[...]
> Does it mean that where ever I use UN_TOKENIZED, they do not get
> through the StandardAnalyzer before getting Indexed, but they do when
> they are searched on?

This is true.

> If that is the case, Do I need to normalise them before adding to
> document?

If you want case-insensitive matching, then yes, you do need to normalize
them before adding them to the document.

> I also would like to know if it is better to employ an EmailAnalyzer
> that makes a TokenStream out of the given email address, rather than
> using a simplistic function that gives me a list of string pieces and
> adding them one by one. With searches, would both the approaches give
> same result?

Yes, both approaches give the same result. When you add string pieces
one-by-one, you are adding multiple same-named fields. By contrast, the
EmailAnalyzer approach would add a single field, and would allow you to
control positions (via Token.setPositionIncrement():
<http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/Token.ht
ml#setPositionIncrement(int)>), e.g. to improve phrase handling. Also, if
you make up an EmailAnalyzer, you can use it to search against your
tokenized email field, along with other analyzer(s) on other field(s), using
the PerFieldAnalyzerWrapper
<http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/PerField
AnalyzerWrapper.html>.

Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


dckorah at gmail

Aug 26, 2008, 6:17 AM

Post #28 of 47 (1441 views)
Permalink
RE: Case Sensitivity [In reply to]

I think I should rephrase my question.

[. Context: Using out of the box StandardAnalyzer for indexing and searching.
]

Is it right to say that a field, if either UN_TOKENIZED or NO_NORMS-ized (
field.setOmitNorms(true) ), it doesn't get analyzed while indexing?
Which means that when we search, it gets thru the analyzer and we need to
analyze them differently in the analyzer we use for searching?
Doesn't it mean that a setOmitNorms(true) field also doesn't get tokenized?

What is the best solution if one was to add a set of fields UN_TOKENIZED and
others TOKENIZED, of the later set a few with setOmitNorms(true) (the index
writer is plain StandardAnalyzer based)? A per field analyzer at query time
?!

Many thanks,
Dino


-----Original Message-----
From: Dino Korah [mailto:dckorah [at] gmail]
Sent: 26 August 2008 12:12
To: 'java-user [at] lucene'
Subject: RE: Case Sensitivity

A little more case sensitivity questions.

Based on the discussion on http://markmail.org/message/q7dqr4r7o6t6dgo5 and
on this thread, is it right to say that a field, if either UN_TOKENIZED or
NO_NORMS-ized, it doesn't get analyzed while indexing? Which means we need
to case-normalize (down-case) those fields before hand?

Doest it mean that if I can afford, I should use norms.

Many thanks,
Dino



-----Original Message-----
From: Steven A Rowe [mailto:sarowe [at] syr]
Sent: 19 August 2008 17:43
To: java-user [at] lucene
Subject: RE: Case Sensitivity

Hi Dino,

I think you'd benefit from reading some FAQ answers, like:

"Why is it important to use the same analyzer type during indexing and
search?"
<http://wiki.apache.org/lucene-java/LuceneFAQ#head-0f374b0fe1483c90fe7d6f2c4
4472d10961ba63c>

Also, have a look at the AnalysisParalysis wiki page for some hints:
<http://wiki.apache.org/lucene-java/AnalysisParalysis>

On 08/19/2008 at 8:57 AM, Dino Korah wrote:
> From the discussion here what I could understand was, if I am using
> StandardAnalyzer on TOKENIZED fields, for both Indexing and Querying,
> I shouldn't have any problems with cases.

If by "shouldn't have problems with cases" you mean "can match
case-insensitively", then this is true.

> But if I have any UN_TOKENIZED fields there will be problems if I do
> not case-normalize them myself before adding them as a field to the
> document.

Again, assuming that by "case-normalize" you mean "downcase", and that you
want case-insensitive matching, and that you use the StandardAnalyzer (or
some other downcasing analyzer) at query-time, then this is true.

> In my case I have a mixed scenario. I am indexing emails and the email
> addresses are indexed UN_TOKENIZED. I do have a second set of custom
> tokenized field, which keep the tokens in individual fields with same
> name.
[...]
> Does it mean that where ever I use UN_TOKENIZED, they do not get
> through the StandardAnalyzer before getting Indexed, but they do when
> they are searched on?

This is true.

> If that is the case, Do I need to normalise them before adding to
> document?

If you want case-insensitive matching, then yes, you do need to normalize
them before adding them to the document.

> I also would like to know if it is better to employ an EmailAnalyzer
> that makes a TokenStream out of the given email address, rather than
> using a simplistic function that gives me a list of string pieces and
> adding them one by one. With searches, would both the approaches give
> same result?

Yes, both approaches give the same result. When you add string pieces
one-by-one, you are adding multiple same-named fields. By contrast, the
EmailAnalyzer approach would add a single field, and would allow you to
control positions (via Token.setPositionIncrement():
<http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/Token.ht
ml#setPositionIncrement(int)>), e.g. to improve phrase handling. Also, if
you make up an EmailAnalyzer, you can use it to search against your
tokenized email field, along with other analyzer(s) on other field(s), using
the PerFieldAnalyzerWrapper
<http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/PerField
AnalyzerWrapper.html>.

Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


otis_gospodnetic at yahoo

Aug 26, 2008, 10:30 PM

Post #29 of 47 (1434 views)
Permalink
Re: Case Sensitivity [In reply to]

Dino, you lost me half-way through your email :(

NO_NORMS does not mean the field is not tokenized.
UN_TOKENIZED does mean the field is not tokenized.


Otis--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Dino Korah <dckorah [at] gmail>
> To: java-user [at] lucene
> Sent: Tuesday, August 26, 2008 9:17:49 AM
> Subject: RE: Case Sensitivity
>
> I think I should rephrase my question.
>
> [. Context: Using out of the box StandardAnalyzer for indexing and searching.
> ]
>
> Is it right to say that a field, if either UN_TOKENIZED or NO_NORMS-ized (
> field.setOmitNorms(true) ), it doesn't get analyzed while indexing?
> Which means that when we search, it gets thru the analyzer and we need to
> analyze them differently in the analyzer we use for searching?
> Doesn't it mean that a setOmitNorms(true) field also doesn't get tokenized?
>
> What is the best solution if one was to add a set of fields UN_TOKENIZED and
> others TOKENIZED, of the later set a few with setOmitNorms(true) (the index
> writer is plain StandardAnalyzer based)? A per field analyzer at query time
> ?!
>
> Many thanks,
> Dino
>
>
> -----Original Message-----
> From: Dino Korah [mailto:dckorah [at] gmail]
> Sent: 26 August 2008 12:12
> To: 'java-user [at] lucene'
> Subject: RE: Case Sensitivity
>
> A little more case sensitivity questions.
>
> Based on the discussion on http://markmail.org/message/q7dqr4r7o6t6dgo5 and
> on this thread, is it right to say that a field, if either UN_TOKENIZED or
> NO_NORMS-ized, it doesn't get analyzed while indexing? Which means we need
> to case-normalize (down-case) those fields before hand?
>
> Doest it mean that if I can afford, I should use norms.
>
> Many thanks,
> Dino
>
>
>
> -----Original Message-----
> From: Steven A Rowe [mailto:sarowe [at] syr]
> Sent: 19 August 2008 17:43
> To: java-user [at] lucene
> Subject: RE: Case Sensitivity
>
> Hi Dino,
>
> I think you'd benefit from reading some FAQ answers, like:
>
> "Why is it important to use the same analyzer type during indexing and
> search?"
>
> 4472d10961ba63c>
>
> Also, have a look at the AnalysisParalysis wiki page for some hints:
>
>
> On 08/19/2008 at 8:57 AM, Dino Korah wrote:
> > From the discussion here what I could understand was, if I am using
> > StandardAnalyzer on TOKENIZED fields, for both Indexing and Querying,
> > I shouldn't have any problems with cases.
>
> If by "shouldn't have problems with cases" you mean "can match
> case-insensitively", then this is true.
>
> > But if I have any UN_TOKENIZED fields there will be problems if I do
> > not case-normalize them myself before adding them as a field to the
> > document.
>
> Again, assuming that by "case-normalize" you mean "downcase", and that you
> want case-insensitive matching, and that you use the StandardAnalyzer (or
> some other downcasing analyzer) at query-time, then this is true.
>
> > In my case I have a mixed scenario. I am indexing emails and the email
> > addresses are indexed UN_TOKENIZED. I do have a second set of custom
> > tokenized field, which keep the tokens in individual fields with same
> > name.
> [...]
> > Does it mean that where ever I use UN_TOKENIZED, they do not get
> > through the StandardAnalyzer before getting Indexed, but they do when
> > they are searched on?
>
> This is true.
>
> > If that is the case, Do I need to normalise them before adding to
> > document?
>
> If you want case-insensitive matching, then yes, you do need to normalize
> them before adding them to the document.
>
> > I also would like to know if it is better to employ an EmailAnalyzer
> > that makes a TokenStream out of the given email address, rather than
> > using a simplistic function that gives me a list of string pieces and
> > adding them one by one. With searches, would both the approaches give
> > same result?
>
> Yes, both approaches give the same result. When you add string pieces
> one-by-one, you are adding multiple same-named fields. By contrast, the
> EmailAnalyzer approach would add a single field, and would allow you to
> control positions (via Token.setPositionIncrement():
>
> ml#setPositionIncrement(int)>), e.g. to improve phrase handling. Also, if
> you make up an EmailAnalyzer, you can use it to search against your
> tokenized email field, along with other analyzer(s) on other field(s), using
> the PerFieldAnalyzerWrapper
>
> AnalyzerWrapper.html>.
>
> Steve
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


otis_gospodnetic at yahoo

Aug 26, 2008, 10:32 PM

Post #30 of 47 (1433 views)
Permalink
Re: Case Sensitivity [In reply to]

Dino,

If a field is not tokenized then it is indexed as is.
For example: "Dino Korah" would get indexed just like that. It would not get split into multiple tokens, it would not be lowercased, it would not have any stop words removed from it, etc.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Dino Korah <dckorah [at] gmail>
> To: java-user [at] lucene
> Sent: Tuesday, August 26, 2008 7:11:42 AM
> Subject: RE: Case Sensitivity
>
> A little more case sensitivity questions.
>
> Based on the discussion on http://markmail.org/message/q7dqr4r7o6t6dgo5 and
> on this thread, is it right to say that a field, if either UN_TOKENIZED or
> NO_NORMS-ized, it doesn't get analyzed while indexing? Which means we need
> to case-normalize (down-case) those fields before hand?
>
> Doest it mean that if I can afford, I should use norms.
>
> Many thanks,
> Dino
>
>
>
> -----Original Message-----
> From: Steven A Rowe [mailto:sarowe [at] syr]
> Sent: 19 August 2008 17:43
> To: java-user [at] lucene
> Subject: RE: Case Sensitivity
>
> Hi Dino,
>
> I think you'd benefit from reading some FAQ answers, like:
>
> "Why is it important to use the same analyzer type during indexing and
> search?"
>
> 4472d10961ba63c>
>
> Also, have a look at the AnalysisParalysis wiki page for some hints:
>
>
> On 08/19/2008 at 8:57 AM, Dino Korah wrote:
> > From the discussion here what I could understand was, if I am using
> > StandardAnalyzer on TOKENIZED fields, for both Indexing and Querying,
> > I shouldn't have any problems with cases.
>
> If by "shouldn't have problems with cases" you mean "can match
> case-insensitively", then this is true.
>
> > But if I have any UN_TOKENIZED fields there will be problems if I do
> > not case-normalize them myself before adding them as a field to the
> > document.
>
> Again, assuming that by "case-normalize" you mean "downcase", and that you
> want case-insensitive matching, and that you use the StandardAnalyzer (or
> some other downcasing analyzer) at query-time, then this is true.
>
> > In my case I have a mixed scenario. I am indexing emails and the email
> > addresses are indexed UN_TOKENIZED. I do have a second set of custom
> > tokenized field, which keep the tokens in individual fields with same
> > name.
> [...]
> > Does it mean that where ever I use UN_TOKENIZED, they do not get
> > through the StandardAnalyzer before getting Indexed, but they do when
> > they are searched on?
>
> This is true.
>
> > If that is the case, Do I need to normalise them before adding to
> > document?
>
> If you want case-insensitive matching, then yes, you do need to normalize
> them before adding them to the document.
>
> > I also would like to know if it is better to employ an EmailAnalyzer
> > that makes a TokenStream out of the given email address, rather than
> > using a simplistic function that gives me a list of string pieces and
> > adding them one by one. With searches, would both the approaches give
> > same result?
>
> Yes, both approaches give the same result. When you add string pieces
> one-by-one, you are adding multiple same-named fields. By contrast, the
> EmailAnalyzer approach would add a single field, and would allow you to
> control positions (via Token.setPositionIncrement():
>
> ml#setPositionIncrement(int)>), e.g. to improve phrase handling. Also, if
> you make up an EmailAnalyzer, you can use it to search against your
> tokenized email field, along with other analyzer(s) on other field(s), using
> the PerFieldAnalyzerWrapper
>
> AnalyzerWrapper.html>.
>
> Steve
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene at mikemccandless

Aug 27, 2008, 2:36 AM

Post #31 of 47 (1433 views)
Permalink
Re: Case Sensitivity [In reply to]

Actually, as confusing as it is, Field.Index.NO_NORMS means
Field.Index.UN_TOKENIZED plus field.setOmitNorms(true).

Probably we should rename it to Field.Index.UN_TOKENiZED_NO_NORMS?

Mike

Otis Gospodnetic wrote:

> Dino, you lost me half-way through your email :(
>
> NO_NORMS does not mean the field is not tokenized.
> UN_TOKENIZED does mean the field is not tokenized.
>
>
> Otis--
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: Dino Korah <dckorah [at] gmail>
>> To: java-user [at] lucene
>> Sent: Tuesday, August 26, 2008 9:17:49 AM
>> Subject: RE: Case Sensitivity
>>
>> I think I should rephrase my question.
>>
>> [. Context: Using out of the box StandardAnalyzer for indexing and
>> searching.
>> ]
>>
>> Is it right to say that a field, if either UN_TOKENIZED or NO_NORMS-
>> ized (
>> field.setOmitNorms(true) ), it doesn't get analyzed while indexing?
>> Which means that when we search, it gets thru the analyzer and we
>> need to
>> analyze them differently in the analyzer we use for searching?
>> Doesn't it mean that a setOmitNorms(true) field also doesn't get
>> tokenized?
>>
>> What is the best solution if one was to add a set of fields
>> UN_TOKENIZED and
>> others TOKENIZED, of the later set a few with setOmitNorms(true)
>> (the index
>> writer is plain StandardAnalyzer based)? A per field analyzer at
>> query time
>> ?!
>>
>> Many thanks,
>> Dino
>>
>>
>> -----Original Message-----
>> From: Dino Korah [mailto:dckorah [at] gmail]
>> Sent: 26 August 2008 12:12
>> To: 'java-user [at] lucene'
>> Subject: RE: Case Sensitivity
>>
>> A little more case sensitivity questions.
>>
>> Based on the discussion on http://markmail.org/message/q7dqr4r7o6t6dgo5
>> and
>> on this thread, is it right to say that a field, if either
>> UN_TOKENIZED or
>> NO_NORMS-ized, it doesn't get analyzed while indexing? Which means
>> we need
>> to case-normalize (down-case) those fields before hand?
>>
>> Doest it mean that if I can afford, I should use norms.
>>
>> Many thanks,
>> Dino
>>
>>
>>
>> -----Original Message-----
>> From: Steven A Rowe [mailto:sarowe [at] syr]
>> Sent: 19 August 2008 17:43
>> To: java-user [at] lucene
>> Subject: RE: Case Sensitivity
>>
>> Hi Dino,
>>
>> I think you'd benefit from reading some FAQ answers, like:
>>
>> "Why is it important to use the same analyzer type during indexing
>> and
>> search?"
>>
>> 4472d10961ba63c>
>>
>> Also, have a look at the AnalysisParalysis wiki page for some hints:
>>
>>
>> On 08/19/2008 at 8:57 AM, Dino Korah wrote:
>>> From the discussion here what I could understand was, if I am using
>>> StandardAnalyzer on TOKENIZED fields, for both Indexing and
>>> Querying,
>>> I shouldn't have any problems with cases.
>>
>> If by "shouldn't have problems with cases" you mean "can match
>> case-insensitively", then this is true.
>>
>>> But if I have any UN_TOKENIZED fields there will be problems if I do
>>> not case-normalize them myself before adding them as a field to the
>>> document.
>>
>> Again, assuming that by "case-normalize" you mean "downcase", and
>> that you
>> want case-insensitive matching, and that you use the
>> StandardAnalyzer (or
>> some other downcasing analyzer) at query-time, then this is true.
>>
>>> In my case I have a mixed scenario. I am indexing emails and the
>>> email
>>> addresses are indexed UN_TOKENIZED. I do have a second set of custom
>>> tokenized field, which keep the tokens in individual fields with
>>> same
>>> name.
>> [...]
>>> Does it mean that where ever I use UN_TOKENIZED, they do not get
>>> through the StandardAnalyzer before getting Indexed, but they do
>>> when
>>> they are searched on?
>>
>> This is true.
>>
>>> If that is the case, Do I need to normalise them before adding to
>>> document?
>>
>> If you want case-insensitive matching, then yes, you do need to
>> normalize
>> them before adding them to the document.
>>
>>> I also would like to know if it is better to employ an EmailAnalyzer
>>> that makes a TokenStream out of the given email address, rather than
>>> using a simplistic function that gives me a list of string pieces
>>> and
>>> adding them one by one. With searches, would both the approaches
>>> give
>>> same result?
>>
>> Yes, both approaches give the same result. When you add string
>> pieces
>> one-by-one, you are adding multiple same-named fields. By contrast,
>> the
>> EmailAnalyzer approach would add a single field, and would allow
>> you to
>> control positions (via Token.setPositionIncrement():
>>
>> ml#setPositionIncrement(int)>), e.g. to improve phrase handling.
>> Also, if
>> you make up an EmailAnalyzer, you can use it to search against your
>> tokenized email field, along with other analyzer(s) on other
>> field(s), using
>> the PerFieldAnalyzerWrapper
>>
>> AnalyzerWrapper.html>.
>>
>> Steve
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


dckorah at gmail

Aug 27, 2008, 3:40 AM

Post #32 of 47 (1437 views)
Permalink
RE: Case Sensitivity [In reply to]

Thanks Otis & Mike.

Probably we should keep it the way it is now. Would be better to include
more information on the various combinations of these options and its effect
on the final result (set of terms that get to the index). Would be nicer if
we could mention the search scenario as well. To be honest, it took me a
while to get a grip on it.

On the same topic, what would be the effect of the following code.

Document doc = new Document();
Field f = new Field("body", bodyText, Field.Store.NO
,Field.Index.TOKENIZED);
f.setOmitNorms(true);

Would that be equivalent to

Document doc = new Document();
Field f = new Field("body", bodyText, Field.Store.NO ,Field.Index.NO_NORMS);
And Field.Index.TOKENIZED has no effect after f.setOmitNorms(true); ?


Many thanks,
Dino


-----Original Message-----
From: Michael McCandless [mailto:lucene [at] mikemccandless]
Sent: 27 August 2008 10:37
To: java-user [at] lucene
Subject: Re: Case Sensitivity


Actually, as confusing as it is, Field.Index.NO_NORMS means
Field.Index.UN_TOKENIZED plus field.setOmitNorms(true).

Probably we should rename it to Field.Index.UN_TOKENiZED_NO_NORMS?

Mike

Otis Gospodnetic wrote:

> Dino, you lost me half-way through your email :(
>
> NO_NORMS does not mean the field is not tokenized.
> UN_TOKENIZED does mean the field is not tokenized.
>
>
> Otis--
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: Dino Korah <dckorah [at] gmail>
>> To: java-user [at] lucene
>> Sent: Tuesday, August 26, 2008 9:17:49 AM
>> Subject: RE: Case Sensitivity
>>
>> I think I should rephrase my question.
>>
>> [. Context: Using out of the box StandardAnalyzer for indexing and
>> searching.
>> ]
>>
>> Is it right to say that a field, if either UN_TOKENIZED or NO_NORMS-
>> ized (
>> field.setOmitNorms(true) ), it doesn't get analyzed while indexing?
>> Which means that when we search, it gets thru the analyzer and we
>> need to analyze them differently in the analyzer we use for
>> searching?
>> Doesn't it mean that a setOmitNorms(true) field also doesn't get
>> tokenized?
>>
>> What is the best solution if one was to add a set of fields
>> UN_TOKENIZED and others TOKENIZED, of the later set a few with
>> setOmitNorms(true) (the index writer is plain StandardAnalyzer
>> based)? A per field analyzer at query time ?!
>>
>> Many thanks,
>> Dino
>>
>>
>> -----Original Message-----
>> From: Dino Korah [mailto:dckorah [at] gmail]
>> Sent: 26 August 2008 12:12
>> To: 'java-user [at] lucene'
>> Subject: RE: Case Sensitivity
>>
>> A little more case sensitivity questions.
>>
>> Based on the discussion on
>> http://markmail.org/message/q7dqr4r7o6t6dgo5
>> and
>> on this thread, is it right to say that a field, if either
>> UN_TOKENIZED or NO_NORMS-ized, it doesn't get analyzed while
>> indexing? Which means we need to case-normalize (down-case) those
>> fields before hand?
>>
>> Doest it mean that if I can afford, I should use norms.
>>
>> Many thanks,
>> Dino
>>
>>
>>
>> -----Original Message-----
>> From: Steven A Rowe [mailto:sarowe [at] syr]
>> Sent: 19 August 2008 17:43
>> To: java-user [at] lucene
>> Subject: RE: Case Sensitivity
>>
>> Hi Dino,
>>
>> I think you'd benefit from reading some FAQ answers, like:
>>
>> "Why is it important to use the same analyzer type during indexing
>> and search?"
>>
>> 4472d10961ba63c>
>>
>> Also, have a look at the AnalysisParalysis wiki page for some hints:
>>
>>
>> On 08/19/2008 at 8:57 AM, Dino Korah wrote:
>>> From the discussion here what I could understand was, if I am using
>>> StandardAnalyzer on TOKENIZED fields, for both Indexing and
>>> Querying, I shouldn't have any problems with cases.
>>
>> If by "shouldn't have problems with cases" you mean "can match
>> case-insensitively", then this is true.
>>
>>> But if I have any UN_TOKENIZED fields there will be problems if I do
>>> not case-normalize them myself before adding them as a field to the
>>> document.
>>
>> Again, assuming that by "case-normalize" you mean "downcase", and
>> that you want case-insensitive matching, and that you use the
>> StandardAnalyzer (or some other downcasing analyzer) at query-time,
>> then this is true.
>>
>>> In my case I have a mixed scenario. I am indexing emails and the
>>> email addresses are indexed UN_TOKENIZED. I do have a second set of
>>> custom tokenized field, which keep the tokens in individual fields
>>> with same name.
>> [...]
>>> Does it mean that where ever I use UN_TOKENIZED, they do not get
>>> through the StandardAnalyzer before getting Indexed, but they do
>>> when they are searched on?
>>
>> This is true.
>>
>>> If that is the case, Do I need to normalise them before adding to
>>> document?
>>
>> If you want case-insensitive matching, then yes, you do need to
>> normalize them before adding them to the document.
>>
>>> I also would like to know if it is better to employ an EmailAnalyzer
>>> that makes a TokenStream out of the given email address, rather than
>>> using a simplistic function that gives me a list of string pieces
>>> and adding them one by one. With searches, would both the approaches
>>> give same result?
>>
>> Yes, both approaches give the same result. When you add string
>> pieces one-by-one, you are adding multiple same-named fields. By
>> contrast, the EmailAnalyzer approach would add a single field, and
>> would allow you to control positions (via
>> Token.setPositionIncrement():
>>
>> ml#setPositionIncrement(int)>), e.g. to improve phrase handling.
>> Also, if
>> you make up an EmailAnalyzer, you can use it to search against your
>> tokenized email field, along with other analyzer(s) on other
>> field(s), using the PerFieldAnalyzerWrapper
>>
>> AnalyzerWrapper.html>.
>>
>> Steve
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucenelist2007 at danielnaber

Aug 27, 2008, 3:47 AM

Post #33 of 47 (1443 views)
Permalink
Re: Case Sensitivity [In reply to]

On Mittwoch, 27. August 2008, Michael McCandless wrote:

> Probably we should rename it to Field.Index.UN_TOKENiZED_NO_NORMS?

I think it's enough if the api doc explains it, no need to rename it.
What's more confusing is that (UN_)TOKENIZED should actually be called
(UN_)ANALYZED IMHO.

Regards
Daniel

--
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene at mikemccandless

Aug 27, 2008, 4:37 AM

Post #34 of 47 (1429 views)
Permalink
Re: Case Sensitivity [In reply to]

Or ... split the two notions apart so that you have Field.Index.
[UN_]ANALYZED and, separately, Field.Index.[NO_]NORMS which could then
be combined together in all 4 combinations (we'd have to fix the
Parameter class to let you build up a new Parameter by combining
existing ones...).

I think naming things well is just as important as good javadocs
explaining things.

But: I think these changes should probably wait until we work out how
to refactor AbstractField/Fieldable/Field?

Mike

Daniel Naber wrote:

> On Mittwoch, 27. August 2008, Michael McCandless wrote:
>
>> Probably we should rename it to Field.Index.UN_TOKENiZED_NO_NORMS?
>
> I think it's enough if the api doc explains it, no need to rename it.
> What's more confusing is that (UN_)TOKENIZED should actually be called
> (UN_)ANALYZED IMHO.
>
> Regards
> Daniel
>
> --
> http://www.danielnaber.de
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


otis_gospodnetic at yahoo

Aug 27, 2008, 8:08 AM

Post #35 of 47 (1432 views)
Permalink
Re: Case Sensitivity [In reply to]

Nah, I think the names are fine, I simply forgot. I looked at the javadocs, it clearly says NO_NORMS doesn't get passed through an Analyzer. Maybe in 3.0 we can switch to NOT_ANALYZED, as suggested, to reflect reality more closely.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Michael McCandless <lucene [at] mikemccandless>
> To: java-user [at] lucene
> Sent: Wednesday, August 27, 2008 5:36:46 AM
> Subject: Re: Case Sensitivity
>
>
> Actually, as confusing as it is, Field.Index.NO_NORMS means
> Field.Index.UN_TOKENIZED plus field.setOmitNorms(true).
>
> Probably we should rename it to Field.Index.UN_TOKENiZED_NO_NORMS?
>
> Mike
>
> Otis Gospodnetic wrote:
>
> > Dino, you lost me half-way through your email :(
> >
> > NO_NORMS does not mean the field is not tokenized.
> > UN_TOKENIZED does mean the field is not tokenized.
> >
> >
> > Otis--
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> >
> > ----- Original Message ----
> >> From: Dino Korah
> >> To: java-user [at] lucene
> >> Sent: Tuesday, August 26, 2008 9:17:49 AM
> >> Subject: RE: Case Sensitivity
> >>
> >> I think I should rephrase my question.
> >>
> >> [. Context: Using out of the box StandardAnalyzer for indexing and
> >> searching.
> >> ]
> >>
> >> Is it right to say that a field, if either UN_TOKENIZED or NO_NORMS-
> >> ized (
> >> field.setOmitNorms(true) ), it doesn't get analyzed while indexing?
> >> Which means that when we search, it gets thru the analyzer and we
> >> need to
> >> analyze them differently in the analyzer we use for searching?
> >> Doesn't it mean that a setOmitNorms(true) field also doesn't get
> >> tokenized?
> >>
> >> What is the best solution if one was to add a set of fields
> >> UN_TOKENIZED and
> >> others TOKENIZED, of the later set a few with setOmitNorms(true)
> >> (the index
> >> writer is plain StandardAnalyzer based)? A per field analyzer at
> >> query time
> >> ?!
> >>
> >> Many thanks,
> >> Dino
> >>
> >>
> >> -----Original Message-----
> >> From: Dino Korah [mailto:dckorah [at] gmail]
> >> Sent: 26 August 2008 12:12
> >> To: 'java-user [at] lucene'
> >> Subject: RE: Case Sensitivity
> >>
> >> A little more case sensitivity questions.
> >>
> >> Based on the discussion on http://markmail.org/message/q7dqr4r7o6t6dgo5
> >> and
> >> on this thread, is it right to say that a field, if either
> >> UN_TOKENIZED or
> >> NO_NORMS-ized, it doesn't get analyzed while indexing? Which means
> >> we need
> >> to case-normalize (down-case) those fields before hand?
> >>
> >> Doest it mean that if I can afford, I should use norms.
> >>
> >> Many thanks,
> >> Dino
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Steven A Rowe [mailto:sarowe [at] syr]
> >> Sent: 19 August 2008 17:43
> >> To: java-user [at] lucene
> >> Subject: RE: Case Sensitivity
> >>
> >> Hi Dino,
> >>
> >> I think you'd benefit from reading some FAQ answers, like:
> >>
> >> "Why is it important to use the same analyzer type during indexing
> >> and
> >> search?"
> >>
> >> 4472d10961ba63c>
> >>
> >> Also, have a look at the AnalysisParalysis wiki page for some hints:
> >>
> >>
> >> On 08/19/2008 at 8:57 AM, Dino Korah wrote:
> >>> From the discussion here what I could understand was, if I am using
> >>> StandardAnalyzer on TOKENIZED fields, for both Indexing and
> >>> Querying,
> >>> I shouldn't have any problems with cases.
> >>
> >> If by "shouldn't have problems with cases" you mean "can match
> >> case-insensitively", then this is true.
> >>
> >>> But if I have any UN_TOKENIZED fields there will be problems if I do
> >>> not case-normalize them myself before adding them as a field to the
> >>> document.
> >>
> >> Again, assuming that by "case-normalize" you mean "downcase", and
> >> that you
> >> want case-insensitive matching, and that you use the
> >> StandardAnalyzer (or
> >> some other downcasing analyzer) at query-time, then this is true.
> >>
> >>> In my case I have a mixed scenario. I am indexing emails and the
> >>> email
> >>> addresses are indexed UN_TOKENIZED. I do have a second set of custom
> >>> tokenized field, which keep the tokens in individual fields with
> >>> same
> >>> name.
> >> [...]
> >>> Does it mean that where ever I use UN_TOKENIZED, they do not get
> >>> through the StandardAnalyzer before getting Indexed, but they do
> >>> when
> >>> they are searched on?
> >>
> >> This is true.
> >>
> >>> If that is the case, Do I need to normalise them before adding to
> >>> document?
> >>
> >> If you want case-insensitive matching, then yes, you do need to
> >> normalize
> >> them before adding them to the document.
> >>
> >>> I also would like to know if it is better to employ an EmailAnalyzer
> >>> that makes a TokenStream out of the given email address, rather than
> >>> using a simplistic function that gives me a list of string pieces
> >>> and
> >>> adding them one by one. With searches, would both the approaches
> >>> give
> >>> same result?
> >>
> >> Yes, both approaches give the same result. When you add string
> >> pieces
> >> one-by-one, you are adding multiple same-named fields. By contrast,
> >> the
> >> EmailAnalyzer approach would add a single field, and would allow
> >> you to
> >> control positions (via Token.setPositionIncrement():
> >>
> >> ml#setPositionIncrement(int)>), e.g. to improve phrase handling.
> >> Also, if
> >> you make up an EmailAnalyzer, you can use it to search against your
> >> tokenized email field, along with other analyzer(s) on other
> >> field(s), using
> >> the PerFieldAnalyzerWrapper
> >>
> >> AnalyzerWrapper.html>.
> >>
> >> Steve
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> >> For additional commands, e-mail: java-user-help [at] lucene
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> >> For additional commands, e-mail: java-user-help [at] lucene
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> > For additional commands, e-mail: java-user-help [at] lucene
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene at mikemccandless

Aug 27, 2008, 4:26 PM

Post #36 of 47 (1413 views)
Permalink
Re: Case Sensitivity [In reply to]

OK I'll open an issue to do this renaming in 3.0, which actually means
we do the renaming in 2.4 or 2.9 (deprecating the old ones) then in
3.0 removing the old ones.

Mike

On Aug 27, 2008, at 11:08 AM, Otis Gospodnetic wrote:

> Nah, I think the names are fine, I simply forgot. I looked at the
> javadocs, it clearly says NO_NORMS doesn't get passed through an
> Analyzer. Maybe in 3.0 we can switch to NOT_ANALYZED, as suggested,
> to reflect reality more closely.
>
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: Michael McCandless <lucene [at] mikemccandless>
>> To: java-user [at] lucene
>> Sent: Wednesday, August 27, 2008 5:36:46 AM
>> Subject: Re: Case Sensitivity
>>
>>
>> Actually, as confusing as it is, Field.Index.NO_NORMS means
>> Field.Index.UN_TOKENIZED plus field.setOmitNorms(true).
>>
>> Probably we should rename it to Field.Index.UN_TOKENiZED_NO_NORMS?
>>
>> Mike
>>
>> Otis Gospodnetic wrote:
>>
>>> Dino, you lost me half-way through your email :(
>>>
>>> NO_NORMS does not mean the field is not tokenized.
>>> UN_TOKENIZED does mean the field is not tokenized.
>>>
>>>
>>> Otis--
>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>>
>>>
>>>
>>> ----- Original Message ----
>>>> From: Dino Korah
>>>> To: java-user [at] lucene
>>>> Sent: Tuesday, August 26, 2008 9:17:49 AM
>>>> Subject: RE: Case Sensitivity
>>>>
>>>> I think I should rephrase my question.
>>>>
>>>> [. Context: Using out of the box StandardAnalyzer for indexing and
>>>> searching.
>>>> ]
>>>>
>>>> Is it right to say that a field, if either UN_TOKENIZED or
>>>> NO_NORMS-
>>>> ized (
>>>> field.setOmitNorms(true) ), it doesn't get analyzed while indexing?
>>>> Which means that when we search, it gets thru the analyzer and we
>>>> need to
>>>> analyze them differently in the analyzer we use for searching?
>>>> Doesn't it mean that a setOmitNorms(true) field also doesn't get
>>>> tokenized?
>>>>
>>>> What is the best solution if one was to add a set of fields
>>>> UN_TOKENIZED and
>>>> others TOKENIZED, of the later set a few with setOmitNorms(true)
>>>> (the index
>>>> writer is plain StandardAnalyzer based)? A per field analyzer at
>>>> query time
>>>> ?!
>>>>
>>>> Many thanks,
>>>> Dino
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Dino Korah [mailto:dckorah [at] gmail]
>>>> Sent: 26 August 2008 12:12
>>>> To: 'java-user [at] lucene'
>>>> Subject: RE: Case Sensitivity
>>>>
>>>> A little more case sensitivity questions.
>>>>
>>>> Based on the discussion on http://markmail.org/message/q7dqr4r7o6t6dgo5
>>>> and
>>>> on this thread, is it right to say that a field, if either
>>>> UN_TOKENIZED or
>>>> NO_NORMS-ized, it doesn't get analyzed while indexing? Which means
>>>> we need
>>>> to case-normalize (down-case) those fields before hand?
>>>>
>>>> Doest it mean that if I can afford, I should use norms.
>>>>
>>>> Many thanks,
>>>> Dino
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Steven A Rowe [mailto:sarowe [at] syr]
>>>> Sent: 19 August 2008 17:43
>>>> To: java-user [at] lucene
>>>> Subject: RE: Case Sensitivity
>>>>
>>>> Hi Dino,
>>>>
>>>> I think you'd benefit from reading some FAQ answers, like:
>>>>
>>>> "Why is it important to use the same analyzer type during indexing
>>>> and
>>>> search?"
>>>>
>>>> 4472d10961ba63c>
>>>>
>>>> Also, have a look at the AnalysisParalysis wiki page for some
>>>> hints:
>>>>
>>>>
>>>> On 08/19/2008 at 8:57 AM, Dino Korah wrote:
>>>>> From the discussion here what I could understand was, if I am
>>>>> using
>>>>> StandardAnalyzer on TOKENIZED fields, for both Indexing and
>>>>> Querying,
>>>>> I shouldn't have any problems with cases.
>>>>
>>>> If by "shouldn't have problems with cases" you mean "can match
>>>> case-insensitively", then this is true.
>>>>
>>>>> But if I have any UN_TOKENIZED fields there will be problems if
>>>>> I do
>>>>> not case-normalize them myself before adding them as a field to
>>>>> the
>>>>> document.
>>>>
>>>> Again, assuming that by "case-normalize" you mean "downcase", and
>>>> that you
>>>> want case-insensitive matching, and that you use the
>>>> StandardAnalyzer (or
>>>> some other downcasing analyzer) at query-time, then this is true.
>>>>
>>>>> In my case I have a mixed scenario. I am indexing emails and the
>>>>> email
>>>>> addresses are indexed UN_TOKENIZED. I do have a second set of
>>>>> custom
>>>>> tokenized field, which keep the tokens in individual fields with
>>>>> same
>>>>> name.
>>>> [...]
>>>>> Does it mean that where ever I use UN_TOKENIZED, they do not get
>>>>> through the StandardAnalyzer before getting Indexed, but they do
>>>>> when
>>>>> they are searched on?
>>>>
>>>> This is true.
>>>>
>>>>> If that is the case, Do I need to normalise them before adding to
>>>>> document?
>>>>
>>>> If you want case-insensitive matching, then yes, you do need to
>>>> normalize
>>>> them before adding them to the document.
>>>>
>>>>> I also would like to know if it is better to employ an
>>>>> EmailAnalyzer
>>>>> that makes a TokenStream out of the given email address, rather
>>>>> than
>>>>> using a simplistic function that gives me a list of string pieces
>>>>> and
>>>>> adding them one by one. With searches, would both the approaches
>>>>> give
>>>>> same result?
>>>>
>>>> Yes, both approaches give the same result. When you add string
>>>> pieces
>>>> one-by-one, you are adding multiple same-named fields. By contrast,
>>>> the
>>>> EmailAnalyzer approach would add a single field, and would allow
>>>> you to
>>>> control positions (via Token.setPositionIncrement():
>>>>
>>>> ml#setPositionIncrement(int)>), e.g. to improve phrase handling.
>>>> Also, if
>>>> you make up an EmailAnalyzer, you can use it to search against your
>>>> tokenized email field, along with other analyzer(s) on other
>>>> field(s), using
>>>> the PerFieldAnalyzerWrapper
>>>>
>>>> AnalyzerWrapper.html>.
>>>>
>>>> Steve
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>>> For additional commands, e-mail: java-user-help [at] lucene
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>>> For additional commands, e-mail: java-user-help [at] lucene
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>>> For additional commands, e-mail: java-user-help [at] lucene
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


otis_gospodnetic at yahoo

Aug 28, 2008, 9:54 AM

Post #37 of 47 (1401 views)
Permalink
Re: Case Sensitivity [In reply to]

So in other words, it *is* possible to have the field both tokenized and its norms omitted?


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Karl Wettin <karl.wettin [at] gmail>
> To: java-user [at] lucene
> Sent: Thursday, August 28, 2008 5:52:54 AM
> Subject: Re: Case Sensitivity
>
>
> 28 aug 2008 kl. 11.46 skrev Andrzej Bialecki:
>
> > Karl Wettin wrote:
> >> 28 aug 2008 kl. 10.58 skrev Dino Korah:
> >>> Document doc = new Document();
> >>> Field f = new Field("body", bodyText, Field.Store.NO,
> >>> Field.Index.TOKENIZED);
> >>> f.setOmitNorms(true);
> >>>
> >>> Would that be equivalent to
> >>>
> >>> Document doc = new Document();
> >>> Field f = new Field("body", bodyText,
> >>> Field.Store.NO ,Field.Index.NO_NORMS);
> >>>
> >>> And Field.Index.TOKENIZED has no effect after
> >>> f.setOmitNorms(true); ?
> >> Yes, those two have the same effect.
> >
> > I don't think so - these two scenarios are different.
> >
> > When you create a Field using Index.NO_NORMS, the constructor makes
> > sure that:
> > isIndexed = true;
> > isTokenized = false;
> > omitNorms = true;
> >
> > When you create a Field using Index.TOKENIZED, the constructor sets
> > these flags:
> > isIndexed = true;
> > isTokenized = true;
> >
> > Then, when you call setOmitNorms(true), it does NOT affect
> > isTokenized, it sets only omitNorms. So the flags are set now like
> > this:
> > isIndexed = true;
> > isTokenized = true;
> > omitNorms = true;
> >
> > The end result of processing such a field is (I believe)
> > conceptually equivalent to adding as many Fields as there are
> > tokens, each with omitNorms=true.
>
> Oh, you are of course right, I was too quick to read. Sorry.
>
>
> karl
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


ab at getopt

Aug 28, 2008, 10:39 AM

Post #38 of 47 (1407 views)
Permalink
Re: Case Sensitivity [In reply to]

Otis Gospodnetic wrote:
> So in other words, it *is* possible to have the field both tokenized and its norms omitted?

Yes. Probably this is an unintended side-effect of adding setOmitNorms,
but I think it's useful and IMHO we should keep it.


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


otis_gospodnetic at yahoo

Aug 28, 2008, 10:42 AM

Post #39 of 47 (1402 views)
Permalink
Re: Case Sensitivity [In reply to]

Yes. And I think I have used this "trick" a couple of years ago, but have since forgotten about it. :)

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Andrzej Bialecki <ab [at] getopt>
> To: java-user [at] lucene
> Sent: Thursday, August 28, 2008 1:39:21 PM
> Subject: Re: Case Sensitivity
>
> Otis Gospodnetic wrote:
> > So in other words, it *is* possible to have the field both tokenized and its
> norms omitted?
>
> Yes. Probably this is an unintended side-effect of adding setOmitNorms,
> but I think it's useful and IMHO we should keep it.
>
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene at mikemccandless

Aug 28, 2008, 10:44 AM

Post #40 of 47 (1407 views)
Permalink
Re: Case Sensitivity [In reply to]

In fact I plan to add it as Field.Index.ANALYZED_NO_NORMS, in this
issue:

https://issues.apache.org/jira/browse/LUCENE-1366

Mike

Otis Gospodnetic wrote:

> Yes. And I think I have used this "trick" a couple of years ago,
> but have since forgotten about it. :)
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>> From: Andrzej Bialecki <ab [at] getopt>
>> To: java-user [at] lucene
>> Sent: Thursday, August 28, 2008 1:39:21 PM
>> Subject: Re: Case Sensitivity
>>
>> Otis Gospodnetic wrote:
>>> So in other words, it *is* possible to have the field both
>>> tokenized and its
>> norms omitted?
>>
>> Yes. Probably this is an unintended side-effect of adding
>> setOmitNorms,
>> but I think it's useful and IMHO we should keep it.
>>
>>
>> --
>> Best regards,
>> Andrzej Bialecki <><
>> ___. ___ ___ ___ _ _ __________________________________
>> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
>> ___|||__|| \| || | Embedded Unix, System Integration
>> http://www.sigram.com Contact: info at sigram dot com
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


ab at getopt

Aug 28, 2008, 10:50 AM

Post #41 of 47 (1398 views)
Permalink
Re: Case Sensitivity [In reply to]

Michael McCandless wrote:
>
> In fact I plan to add it as Field.Index.ANALYZED_NO_NORMS, in this issue:
>
> https://issues.apache.org/jira/browse/LUCENE-1366

This has consequences when searching - so if we expose it the javadoc
has to be really good at explaining what's going on :)


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


yonik at apache

Aug 28, 2008, 10:50 AM

Post #42 of 47 (1403 views)
Permalink
Re: Case Sensitivity [In reply to]

On Thu, Aug 28, 2008 at 1:44 PM, Michael McCandless
<lucene [at] mikemccandless> wrote:
>
> In fact I plan to add it as Field.Index.ANALYZED_NO_NORMS, in this issue:

I wasn't originally going to add a Field.Index at all for omitNorms,
but Doug suggested it.
The problem with this type-safe way of doing things is the
combinatorial explosion.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene at mikemccandless

Aug 28, 2008, 11:16 AM

Post #43 of 47 (1401 views)
Permalink
Re: Case Sensitivity [In reply to]

Yonik Seeley wrote:

> On Thu, Aug 28, 2008 at 1:44 PM, Michael McCandless
> <lucene [at] mikemccandless> wrote:
>>
>> In fact I plan to add it as Field.Index.ANALYZED_NO_NORMS, in this
>> issue:
>
> I wasn't originally going to add a Field.Index at all for omitNorms,
> but Doug suggested it.
> The problem with this type-safe way of doing things is the
> combinatorial explosion.

Yeah I realize that. Now that we have omitTF as an option we could
really go crazy ;)

I figured since we already have NOT_ANALYZED_NO_NORMS we may as well
round it out with ANALYZED_NO_NORMS, and then stop there. Plus,
people have been surprised that you could do ANALYZED_NO_NORMS, yet it
is useful.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene at mikemccandless

Aug 28, 2008, 11:17 AM

Post #44 of 47 (1398 views)
Permalink
Re: Case Sensitivity [In reply to]

Andrzej Bialecki wrote:

> Michael McCandless wrote:
>> In fact I plan to add it as Field.Index.ANALYZED_NO_NORMS, in this
>> issue:
>> https://issues.apache.org/jira/browse/LUCENE-1366
>
> This has consequences when searching - so if we expose it the
> javadoc has to be really good at explaining what's going on :)

Agreed, I'll fix the javadocs and mark these as Expert.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


anthony.urso at gmail

Sep 11, 2008, 6:47 PM

Post #45 of 47 (1221 views)
Permalink
Re: Case Sensitivity [In reply to]

On Thu, Aug 28, 2008 at 11:16 AM, Michael McCandless
<lucene [at] mikemccandless> wrote:
>
> Yonik Seeley wrote:
>>
>> I wasn't originally going to add a Field.Index at all for omitNorms,
>> but Doug suggested it.
>> The problem with this type-safe way of doing things is the
>> combinatorial explosion.
>
> Yeah I realize that. Now that we have omitTF as an option we could really
> go crazy ;)
>
> I figured since we already have NOT_ANALYZED_NO_NORMS we may as well round
> it out with ANALYZED_NO_NORMS, and then stop there. Plus, people have been
> surprised that you could do ANALYZED_NO_NORMS, yet it is useful.

Why not make this flag field into a bitmap?

Cheers,
Anthony

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


lucene at mikemccandless

Sep 19, 2008, 5:11 AM

Post #46 of 47 (1188 views)
Permalink
Re: Case Sensitivity [In reply to]

Anthony Urso wrote:

> On Thu, Aug 28, 2008 at 11:16 AM, Michael McCandless
> <lucene [at] mikemccandless> wrote:
>>
>> Yonik Seeley wrote:
>>>
>>> I wasn't originally going to add a Field.Index at all for omitNorms,
>>> but Doug suggested it.
>>> The problem with this type-safe way of doing things is the
>>> combinatorial explosion.
>>
>> Yeah I realize that. Now that we have omitTF as an option we could
>> really
>> go crazy ;)
>>
>> I figured since we already have NOT_ANALYZED_NO_NORMS we may as
>> well round
>> it out with ANALYZED_NO_NORMS, and then stop there. Plus, people
>> have been
>> surprised that you could do ANALYZED_NO_NORMS, yet it is useful.
>
> Why not make this flag field into a bitmap?

I think that makes sense, at some point in the future (when we clean
up Fieldable/AbstractField/Field?). This way you can OR together
things like NORMS/NO_NORMS, ANALYZED/NOT_ANALYZED, INCLUDE_TF/OMIT_TF,
etc.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


ab at getopt

Sep 19, 2008, 5:20 AM

Post #47 of 47 (1187 views)
Permalink
Re: Case Sensitivity [In reply to]

Michael McCandless wrote:
>
> Anthony Urso wrote:
>
>> On Thu, Aug 28, 2008 at 11:16 AM, Michael McCandless
>> <lucene [at] mikemccandless> wrote:
>>>
>>> Yonik Seeley wrote:
>>>>
>>>> I wasn't originally going to add a Field.Index at all for omitNorms,
>>>> but Doug suggested it.
>>>> The problem with this type-safe way of doing things is the
>>>> combinatorial explosion.
>>>
>>> Yeah I realize that. Now that we have omitTF as an option we could
>>> really
>>> go crazy ;)
>>>
>>> I figured since we already have NOT_ANALYZED_NO_NORMS we may as well
>>> round
>>> it out with ANALYZED_NO_NORMS, and then stop there. Plus, people
>>> have been
>>> surprised that you could do ANALYZED_NO_NORMS, yet it is useful.
>>
>> Why not make this flag field into a bitmap?
>
> I think that makes sense, at some point in the future (when we clean up
> Fieldable/AbstractField/Field?). This way you can OR together things
> like NORMS/NO_NORMS, ANALYZED/NOT_ANALYZED, INCLUDE_TF/OMIT_TF, etc.

+1 on that. AFAIR the original motivation for these type-safe
enumerations was that some combination of flags are invalid /
unsupported, and then you would discover it only at runtime. But the
problems with this approach seem to outweigh the benefits ...

Perhaps we could provide static methods on Fieldable that test the
validity of flag combinations with particular version of Lucene?

--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

First page Previous page 1 2 Next page Last page  View All Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.