
marvin at rectangular
Mar 3, 2008, 3:10 PM
Post #2 of 5
(1178 views)
Permalink
|
On Mar 3, 2008, at 8:43 AM, Father Chrysostomos wrote: > I looked into it further and found that ‘ἐνδιαφέρον’ > came out encoded as UTF-8 > ("\341 > \274 > \220 > \316 > \275 > \316 > \264\316\271\316\261\317\206\341\275\263\317\201\316\277\316\275"), If we isolate the original and use Devel::Peek to inspect it... use Devel::Peek; my $greek = 'ἐνδιαφέρον'; Dump($greek); ... this is what we see: SV = PV(0x91e374) at 0x8972dc REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x11742e0 "\341 \274 \220 \316 \275 \316 \264\316\271\316\261\317\206\341\275\263\317\201\316\277\316\275"\0 [UTF8 "\x{1f10}\x{3bd} \x{3b4}\x{3b9}\x{3b1}\x{3c6}\x{1f73}\x{3c1}\x{3bf}\x{3bd}"] CUR = 22 LEN = 24 Here's what's coming out of $lexicon->get_term: SV = PV(0x91d0dc) at 0x912f80 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) PV = 0x117fce0 "\341 \274 \220 \316 \275 \316\264\316\271\316\261\317\206\341\275\263\317\201\316\277\316\275"\0 CUR = 22 LEN = 24 The strings have the same byte sequence, but the second one is missing the UTF8 flag, so Perl is interpreting it as Latin1. When we submit that scalar to $reader->doc_freq, the XS binding extracts the string using SvPVutf8, which causes the supposedly Latin1 string to be, ahem, "upgraded" to UTF8. The resulting garbage isn't in the index. The problem was a missing SvUTF8_on in the XS binding for Lexicon_Get_Term. Fixed by r3103. Thanks for the report. Marvin Humphrey Rectangular Research http://www.rectangular.com/ _______________________________________________ KinoSearch mailing list KinoSearch [at] rectangular http://www.rectangular.com/mailman/listinfo/kinosearch
|