Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Lucene tokenization

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


nilesh.vijay at gmail

Mar 27, 2012, 11:03 AM

Post #1 of 3 (164 views)
Permalink
Lucene tokenization

I have a string 01a_b-_-c-d which is tokenized as
01a_b
c
d

and the string a_b-_-c_d which is tokenized as
a
b
c
d

why is there a difference when there is a digit at the beginning? I am
using standard unstemmed tokenizer.


sarowe at syr

Mar 27, 2012, 11:11 AM

Post #2 of 3 (162 views)
Permalink
RE: Lucene tokenization [In reply to]

Hi Nilesh,

Which version of Lucene are you using? StandardTokenizer behavior changed in v3.1.

Steve

-----Original Message-----
From: Nilesh Vijaywargiay [mailto:nilesh.vijay [at] gmail]
Sent: Tuesday, March 27, 2012 2:04 PM
To: java-user [at] lucene
Subject: Lucene tokenization

I have a string 01a_b-_-c-d which is tokenized as 01a_b c d

and the string a_b-_-c_d which is tokenized as a b c d

why is there a difference when there is a digit at the beginning? I am using standard unstemmed tokenizer.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


paul at hoplahup

Mar 27, 2012, 11:34 PM

Post #3 of 3 (162 views)
Permalink
Re: Lucene tokenization [In reply to]

Nilesh,

the StandardAnalyzer is full of generally useful special cases, including emails and numbers detection.
I am supposing you met one such special case which has a justification of some sort.
I can't tell you why but I can tell it's really hard to change because others rely on this somehow (I think).

paul


Le 27 mars 2012 à 20:03, Nilesh Vijaywargiay a écrit :

> I have a string 01a_b-_-c-d which is tokenized as
> 01a_b
> c
> d
>
> and the string a_b-_-c_d which is tokenized as
> a
> b
> c
> d
>
> why is there a difference when there is a digit at the beginning? I am
> using standard unstemmed tokenizer.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.