
bimargulies at gmail
Mar 30, 2012, 10:07 AM
Post #4 of 6
(267 views)
Permalink
|
|
Re: Problems Indexing/Parsing Tibetan Text
[In reply to]
|
|
fileformat.info On Mar 30, 2012, at 1:04 PM, Denis Brodeur <denisbrodeur [at] gmail> wrote: > Thanks Robert. That makes sense. Do you have a link handy where I can > find this information? i.e. word boundary/punctuation for any unicode > character set? > > On Fri, Mar 30, 2012 at 12:57 PM, Robert Muir <rcmuir [at] gmail> wrote: > >> On Fri, Mar 30, 2012 at 12:46 PM, Denis Brodeur <denisbrodeur [at] gmail> >> wrote: >>> Hello, I'm currently working out some problems when searching for Tibetan >>> Characters. More specifically: /u0f10-/u0f19. We are using the >> >> unicode doesn't consider most of these characters part of a word: most >> are punctuation and symbols >> (except 0f18 and 0f19 which are combining characters that combine with >> digits). >> >> for example 0f14 is a text delimiter. >> >> in general standardtokenizer discards punctuation and is geared at >> word boundaries, just like >> you would have trouble searching on characters like '(', etc in >> english. So i think its totally expected. >> >> -- >> lucidimagination.com >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene >> For additional commands, e-mail: java-user-help [at] lucene >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe [at] lucene For additional commands, e-mail: java-user-help [at] lucene
|