Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

replace values in index

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


jeff.lucene at gmail

Jul 12, 2007, 6:40 AM

Post #1 of 2 (473 views)
Permalink
replace values in index

I have documents with lots of text. Part of the text is in the following
format:

word1,word2,word3,word4,word5

I am currently using the StandardAnalyzer and everything is working great
with the other data, except I can't query for 'word3' as a ',' isn't a token
seperator. Is there an easy way to add ',' as a token seperator?

Thanks,

-Jeff


markrmiller at gmail

Jul 12, 2007, 7:07 AM

Post #2 of 2 (423 views)
Permalink
Re: replace values in index [In reply to]

While it is possible to alter the StandardAnalyzer, depending on more
details of your source text, it may be better to use a different
analyzer or make your own. The StandardAnalyzer is quite slow if you do
not need all of its features, and modifying it will make it harder to
keep up with bug fixes or improvements.

That said, StandardAnalyzer does split on commas, so you might want to
check into whats really going on.

I suspect that 'word1,word2,word3,word4,word5' is being recognized as a
NUM by StandardAnalzyer. A NUM match will keep a comma deliminated list
intact as long as every other word contains a digit.

You might alter the <#P regular expression in StandardAnalyzer.jj by
taking out the ','. This will take out certain matches (like the match
your getting <g>), but will stop screwing up your matches.

- Mark

Jeff wrote:
> I have documents with lots of text. Part of the text is in the following
> format:
>
> word1,word2,word3,word4,word5
>
> I am currently using the StandardAnalyzer and everything is working great
> with the other data, except I can't query for 'word3' as a ',' isn't a
> token
> seperator. Is there an easy way to add ',' as a token seperator?
>
> Thanks,
>
> -Jeff
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.