Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Tokenizer

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


jsondag2 at uiuc

Jul 30, 2007, 9:05 AM

Post #1 of 2 (395 views)
Permalink
Tokenizer

I have two questions.

First, Is there a tokenizer that takes every word and simply makes a token
out of it? So it looks for two white spaces and takes the characters
between them and makes a token out of them?

If this tokenizer exists, is there a difference between doing that and
simply storing the field in the document with Field.Index = UN_TOKENIZED?

--JP


a.schrijvers at hippo

Jul 30, 2007, 9:15 AM

Post #2 of 2 (357 views)
Permalink
RE: Tokenizer [In reply to]

Hello,

> I have two questions.
>
> First, Is there a tokenizer that takes every word and simply
> makes a token
> out of it?

org.apache.lucene.analysis.WhitespaceTokenizer

> So it looks for two white spaces and takes the characters
> between them and makes a token out of them?
>
> If this tokenizer exists, is there a difference between doing that and
> simply storing the field in the document with Field.Index =
> UN_TOKENIZED?

Yes certainly it is different. UN_TOKENIZED is as it says, taking a String for example and put it "AS IS", as one single TERM in your index. For example you might want to do this when you want to sort on caption of a document, or title. TOKENIZED in combination with the org.apache.lucene.analysis.WhitespaceTokenizer tokenizes your string and indexes.

Regards Ard

>
> --JP
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.