public at khwilliamson
Jun 3, 2012, 9:02 PM
People might find this illuminating. I examined the steps involved in
Inefficiencies in Perl's case mapping of Unicode
taking an above-Latin1 character and finding the equivalent of it in
another case mapping, such as its uppercase, or its foldcase. The
bottom level function to do all such conversions is to_utf8_case(). It
maps the input character stored as UTF-8 into its other-case equivalent.
What happens is this:
1) The UTF-8 is converted to UTF-32
2) The UTF-32 is converted back to UTF-8
3) The bytes of the UTF-8 are used as a key to look up in a
hash of characters that have a multi-character map (such as
uc(ß) is SS). If found, the value is the map stored as a
UTF-8 string, which is copied to the location pointed to by
an input parameter; and then goto step 6
4) In the more likely case that the map is a single character,
swash_fetch() is called with the parameter being in UTF-8.
This function checks its hash of previously looked-up
values, where the keys are UTF-8, and the values are UTF-32.
If the input UTF-8 has been seen before, it returns the
stored UTF-32. If it hasn't been seen before, it converts
the UTF-8 to UTF-32, and goes out to the disk files
generated by mktables to find its map, which, if found, it
stores, and returns in UTF-32.
5a) If the returned UTF-32 is non-zero, to_utf8_case() assumes
this means that there was a mapping, and converts it to
UTF-8 for return in the location pointed to by an input
5b) If the returned UTF-32 is zero, it assumes that the mapping
of the character is to itself; the UTF-32 found in step 1 is
converted back to (the input) UTF-8 for return in the
location pointed to by an input parameter
6) The first character of the map is converted to UTF-32 and
(To be more precise, on 64-bit machines, UTF-64 is used instead of
UTF-32.) I have omitted steps unnecessary for this discussion.
There are several things apparent to me. One is that there are
altogether too many conversions between UTF-8 and UTF-32. These are not
trivial. Another (very minor) thing is that the code assumes that all
mappings of NUL will be to itself. This latter should be commented, but
is likely to always be true. Finally, in the case of the mapping being
to itself, we do even more extra work at the end, since the output is
just the input.
There are also several functions which take UTF-32 as input and have the
same outputs as this one, such as to_uni_upper(). All these do is add a
Step 0 which converts the UTF-32 to UTF-8 and then calls to_utf8_case()
(which converts it immediately back to UTF-32, and so on.)
The reason for the apparently redundant conversions in Step 1 and 2 is
in EBCDIC machines it needs to convert things to conform with the Latin1
tables generated by mktables. This could be #ifdef'd out (with some
refactoring), but even better would be (as I have proposed in another
email) for mktables to generate EBCDIC tables, which would eliminate the
whole necessity of this.
The comments about why the key for the multi-char maps is stored in
UTF-8 indicate that it is for speed, to avoid having to find the UTF-32.
But we, in fact, find it anyway.
I'm just pointing out the inefficiencies here. There are a number of
things that could be done to cut some of the conversions down.