
public at khwilliamson
Jun 3, 2012, 9:02 PM
Views: 50
Permalink
|
|
Inefficiencies in Perl's case mapping of Unicode
|
|
People might find this illuminating. I examined the steps involved in taking an above-Latin1 character and finding the equivalent of it in another case mapping, such as its uppercase, or its foldcase. The bottom level function to do all such conversions is to_utf8_case(). It maps the input character stored as UTF-8 into its other-case equivalent. What happens is this: 1) The UTF-8 is converted to UTF-32 2) The UTF-32 is converted back to UTF-8 3) The bytes of the UTF-8 are used as a key to look up in a hash of characters that have a multi-character map (such as uc(ß) is SS). If found, the value is the map stored as a UTF-8 string, which is copied to the location pointed to by an input parameter; and then goto step 6 4) In the more likely case that the map is a single character, swash_fetch() is called with the parameter being in UTF-8. This function checks its hash of previously looked-up values, where the keys are UTF-8, and the values are UTF-32. If the input UTF-8 has been seen before, it returns the stored UTF-32. If it hasn't been seen before, it converts the UTF-8 to UTF-32, and goes out to the disk files generated by mktables to find its map, which, if found, it stores, and returns in UTF-32. 5a) If the returned UTF-32 is non-zero, to_utf8_case() assumes this means that there was a mapping, and converts it to UTF-8 for return in the location pointed to by an input parameter 5b) If the returned UTF-32 is zero, it assumes that the mapping of the character is to itself; the UTF-32 found in step 1 is converted back to (the input) UTF-8 for return in the location pointed to by an input parameter 6) The first character of the map is converted to UTF-32 and returned. (To be more precise, on 64-bit machines, UTF-64 is used instead of UTF-32.) I have omitted steps unnecessary for this discussion. There are several things apparent to me. One is that there are altogether too many conversions between UTF-8 and UTF-32. These are not trivial. Another (very minor) thing is that the code assumes that all mappings of NUL will be to itself. This latter should be commented, but is likely to always be true. Finally, in the case of the mapping being to itself, we do even more extra work at the end, since the output is just the input. There are also several functions which take UTF-32 as input and have the same outputs as this one, such as to_uni_upper(). All these do is add a Step 0 which converts the UTF-32 to UTF-8 and then calls to_utf8_case() (which converts it immediately back to UTF-32, and so on.) The reason for the apparently redundant conversions in Step 1 and 2 is in EBCDIC machines it needs to convert things to conform with the Latin1 tables generated by mktables. This could be #ifdef'd out (with some refactoring), but even better would be (as I have proposed in another email) for mktables to generate EBCDIC tables, which would eliminate the whole necessity of this. The comments about why the key for the multi-char maps is stored in UTF-8 indicate that it is for speed, to avoid having to find the UTF-32. But we, in fact, find it anyway. I'm just pointing out the inefficiencies here. There are a number of things that could be done to cut some of the conversions down.
|