
uhlar at fantomas
May 18, 2012, 4:45 AM
Post #7 of 7
(285 views)
Permalink
|
>On 18/05/12 03:18, David F. Skoll wrote: >> I looked at the regex and it seems that Perl treats är as having a >> word boundary in the \b sense between the "ä" and the "r" On 18.05.12 07:26, Jason Haar wrote: >A bit OT, but is it because your perl is running under "C" locale >instead of se? i.e. would the word boundary definition change under >different localization contexts? Doesn't help solve the problem for you, >but it certainly flags a potential issue with a tonne of the rules in SA... sa would need to switch to correct locale before processing of the e-mail to avoid this error. Setting the correct locale could be different for different users and even for different mails. I'm not sure if this is a way to go, although there may be single cases where it helps. I'm more in favor of advanced processing, watching different languages and/or comparing matching strings for words in different languages, e.g. FRT_SOMA misfiring for word "somar" (donkey), FRT_PENIS1 for "penize" (money), FUZZY_CREDIT for "kredit" (credit) etc. -- Matus UHLAR - fantomas, uhlar [at] fantomas ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. Remember half the people you know are below average.
|