Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Lucene: Java-User

Problems Indexing/Parsing Tibetan Text

 

 

Lucene java-user RSS feed   Index | Next | Previous | View Threaded


denisbrodeur at gmail

Mar 30, 2012, 9:46 AM

Post #1 of 6 (526 views)
Permalink
Problems Indexing/Parsing Tibetan Text

Hello, I'm currently working out some problems when searching for Tibetan
Characters. More specifically: /u0f10-/u0f19. We are using the
StandardAnalyzer (3.4) and I've narrowed the problem down to
StandardTokenizerImpl throwing away these characters i.e. in
getNextToken(), falls through case1: /* Not numeric, word, ideographic,
hiragana, or SE Asian -- ignore it */. So, the question is: is this the
expected behaviour and if it is what would be the best way to go about
supporting code points that are not recognized by the StandardAnalyzer in a
general way?


rcmuir at gmail

Mar 30, 2012, 9:57 AM

Post #2 of 6 (520 views)
Permalink
Re: Problems Indexing/Parsing Tibetan Text [In reply to]

On Fri, Mar 30, 2012 at 12:46 PM, Denis Brodeur <denisbrodeur [at] gmail> wrote:
> Hello, I'm currently working out some problems when searching for Tibetan
> Characters.  More specifically: /u0f10-/u0f19.  We are using the

unicode doesn't consider most of these characters part of a word: most
are punctuation and symbols
(except 0f18 and 0f19 which are combining characters that combine with digits).

for example 0f14 is a text delimiter.

in general standardtokenizer discards punctuation and is geared at
word boundaries, just like
you would have trouble searching on characters like '(', etc in
english. So i think its totally expected.

--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


denisbrodeur at gmail

Mar 30, 2012, 10:03 AM

Post #3 of 6 (519 views)
Permalink
Re: Problems Indexing/Parsing Tibetan Text [In reply to]

Thanks Robert. That makes sense. Do you have a link handy where I can
find this information? i.e. word boundary/punctuation for any unicode
character set?

On Fri, Mar 30, 2012 at 12:57 PM, Robert Muir <rcmuir [at] gmail> wrote:

> On Fri, Mar 30, 2012 at 12:46 PM, Denis Brodeur <denisbrodeur [at] gmail>
> wrote:
> > Hello, I'm currently working out some problems when searching for Tibetan
> > Characters. More specifically: /u0f10-/u0f19. We are using the
>
> unicode doesn't consider most of these characters part of a word: most
> are punctuation and symbols
> (except 0f18 and 0f19 which are combining characters that combine with
> digits).
>
> for example 0f14 is a text delimiter.
>
> in general standardtokenizer discards punctuation and is geared at
> word boundaries, just like
> you would have trouble searching on characters like '(', etc in
> english. So i think its totally expected.
>
> --
> lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>
>


bimargulies at gmail

Mar 30, 2012, 10:07 AM

Post #4 of 6 (516 views)
Permalink
Re: Problems Indexing/Parsing Tibetan Text [In reply to]

fileformat.info

On Mar 30, 2012, at 1:04 PM, Denis Brodeur <denisbrodeur [at] gmail> wrote:

> Thanks Robert. That makes sense. Do you have a link handy where I can
> find this information? i.e. word boundary/punctuation for any unicode
> character set?
>
> On Fri, Mar 30, 2012 at 12:57 PM, Robert Muir <rcmuir [at] gmail> wrote:
>
>> On Fri, Mar 30, 2012 at 12:46 PM, Denis Brodeur <denisbrodeur [at] gmail>
>> wrote:
>>> Hello, I'm currently working out some problems when searching for Tibetan
>>> Characters. More specifically: /u0f10-/u0f19. We are using the
>>
>> unicode doesn't consider most of these characters part of a word: most
>> are punctuation and symbols
>> (except 0f18 and 0f19 which are combining characters that combine with
>> digits).
>>
>> for example 0f14 is a text delimiter.
>>
>> in general standardtokenizer discards punctuation and is geared at
>> word boundaries, just like
>> you would have trouble searching on characters like '(', etc in
>> english. So i think its totally expected.
>>
>> --
>> lucidimagination.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
>> For additional commands, e-mail: java-user-help [at] lucene
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


rcmuir at gmail

Mar 30, 2012, 10:09 AM

Post #5 of 6 (521 views)
Permalink
Re: Problems Indexing/Parsing Tibetan Text [In reply to]

On Fri, Mar 30, 2012 at 1:03 PM, Denis Brodeur <denisbrodeur [at] gmail> wrote:
> Thanks Robert.  That makes sense.  Do you have a link handy where I can
> find this information? i.e. word boundary/punctuation for any unicode
> character set?
>

yeah, usually i use
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[\u0f10-\u0f19]&g=

you can then click on a character and see all of its properties easily.

(site seems to have some issues today)

--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene


mintern at easyesi

Mar 30, 2012, 11:11 AM

Post #6 of 6 (516 views)
Permalink
Re: Problems Indexing/Parsing Tibetan Text [In reply to]

Another good reference is this one: http://unicode.org/reports/tr29/

Since the latest Lucene uses this for the basis of its text
segmentation, it's worth getting familiar with it.

On Fri, Mar 30, 2012 at 10:09 AM, Robert Muir <rcmuir [at] gmail> wrote:
> On Fri, Mar 30, 2012 at 1:03 PM, Denis Brodeur <denisbrodeur [at] gmail> wrote:
>> Thanks Robert. That makes sense. Do you have a link handy where I can
>> find this information? i.e. word boundary/punctuation for any unicode
>> character set?
>>
>
> yeah, usually i use
> http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[\u0f10-\u0f19]&g=
>
> you can then click on a character and see all of its properties easily.
>
> (site seems to have some issues today)
>
> --
> lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
> For additional commands, e-mail: java-user-help [at] lucene
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe [at] lucene
For additional commands, e-mail: java-user-help [at] lucene

Lucene java-user RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.