Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: kinosearch: discuss

Unicode problem

 

 

kinosearch discuss RSS feed   Index | Next | Previous | View Threaded


sprout at cpan

Mar 3, 2008, 8:43 AM

Post #1 of 5 (1236 views)
Permalink
Unicode problem

There seems to be a problem with KinoSearch’s Unicode support. Greek
words can be listed in the index, but they always have a doc_freq of
0. The attached script demonstrates this problem. This is the output
it gives me:

Greek occurs in 1 document.
Hmm occurs in 1 document.
as occurs in 1 document.
in occurs in 1 document.
interesting occurs in 1 document.
or occurs in 1 document.
say occurs in 1 document.
they occurs in 1 document.
ἐνδιαφέρον occurs in 0 documents.

It didn’t give me any wide char warnings, so I looked into it further
and found that ‘ἐνδιαφέρον’ came out encoded as UTF-8
("\341
\274
\220
\316
\275
\316
\264\316\271\316\261\317\206\341\275\263\317\201\316\277\316\275"), so
maybe that’s part of the problem.
Attachments: unitest (1.00 KB)


marvin at rectangular

Mar 3, 2008, 3:10 PM

Post #2 of 5 (1179 views)
Permalink
Re: Unicode problem [In reply to]

On Mar 3, 2008, at 8:43 AM, Father Chrysostomos wrote:

> I looked into it further and found that ‘ἐνδιαφέρον’
> came out encoded as UTF-8
> ("\341
> \274
> \220
> \316
> \275
> \316
> \264\316\271\316\261\317\206\341\275\263\317\201\316\277\316\275"),

If we isolate the original and use Devel::Peek to inspect it...

use Devel::Peek;
my $greek = 'ἐνδιαφέρον';
Dump($greek);

... this is what we see:

SV = PV(0x91e374) at 0x8972dc
REFCNT = 1
FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
PV = 0x11742e0
"\341
\274
\220
\316
\275
\316
\264\316\271\316\261\317\206\341\275\263\317\201\316\277\316\275"\0
[UTF8 "\x{1f10}\x{3bd}
\x{3b4}\x{3b9}\x{3b1}\x{3c6}\x{1f73}\x{3c1}\x{3bf}\x{3bd}"]
CUR = 22
LEN = 24

Here's what's coming out of $lexicon->get_term:

SV = PV(0x91d0dc) at 0x912f80
REFCNT = 1
FLAGS = (PADBUSY,PADMY,POK,pPOK)
PV = 0x117fce0
"\341
\274
\220
\316
\275
\316\264\316\271\316\261\317\206\341\275\263\317\201\316\277\316\275"\0
CUR = 22
LEN = 24

The strings have the same byte sequence, but the second one is missing
the UTF8 flag, so Perl is interpreting it as Latin1.

When we submit that scalar to $reader->doc_freq, the XS binding
extracts the string using SvPVutf8, which causes the supposedly Latin1
string to be, ahem, "upgraded" to UTF8. The resulting garbage isn't
in the index.

The problem was a missing SvUTF8_on in the XS binding for
Lexicon_Get_Term. Fixed by r3103. Thanks for the report.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch


sprout at cpan

Mar 4, 2008, 2:43 PM

Post #3 of 5 (1185 views)
Permalink
Re: Unicode problem [In reply to]

On Mar 3, 2008, at 3:10 PM, Marvin Humphrey wrote:

> The problem was a missing SvUTF8_on in the XS binding for
> Lexicon_Get_Term. Fixed by r3103. Thanks for the report.

Heres a test for it.
Attachments: open_clSTSxHC.txt (0.99 KB)


sprout at cpan

Mar 11, 2008, 11:57 AM

Post #4 of 5 (1161 views)
Permalink
Re: Unicode problem [In reply to]

I sent this a week ago. Did you get it?

On Mar 4, 2008, at 2:43 PM, Father Chrysostomos wrote:

>
> On Mar 3, 2008, at 3:10 PM, Marvin Humphrey wrote:
>
>> The problem was a missing SvUTF8_on in the XS binding for
>> Lexicon_Get_Term. Fixed by r3103. Thanks for the report.
>
> Heres a test for it.
>
Attachments: open_clSTSxHC.txt (0.99 KB)


marvin at rectangular

Mar 11, 2008, 3:43 PM

Post #5 of 5 (1174 views)
Permalink
Re: Unicode problem [In reply to]

On Mar 11, 2008, at 11:57 AM, Father Chrysostomos wrote:

> I sent this a week ago. Did you get it?

Applied as r3119. Thanks!

I've been working on code to automatically generate most of the KS XS
bindings, hopefully reducing the frequency of one-off bugs like this.
The test will come in handy for verifying the proper behavior after
the refactor.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch [at] rectangular
http://www.rectangular.com/mailman/listinfo/kinosearch

kinosearch discuss RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.