
public at khwilliamson
Nov 27, 2009, 8:44 AM
Post #1 of 9
(547 views)
Permalink
|
I believe the current definition of \X is flawed. First of all it isn't the Unicode concept it purports to be. \X is defined as qr/(?>\PM\pM*)/, and in several places in the documentation, it says that this is a Unicode "combining character sequence". The current definition for that concept is qr/ {base}? \pM | \N{ZWJ} | \N{ZWNJ} )+/x where {base} is something but not exactly like \PM. Assume for the sake of argument that it is exactly \PM. Note that it is optional in the Unicode definition, but not the Perl. Further note that the \pM is optional in the Perl definition, but not the Unicode one. This means, for example, that \X matches 'A', but a 'combining character sequence' does not. In other places in the Perl documentation, it says that \X is an "extended Unicode combining character sequence". I don't know what this means. Unicode has an "extended combining character sequence" which is similar to the regular one, but includes Hangul (Korean) syllables, again it's not the Perl definition. Perhaps what is meant is that Perl has extended (modified) the Unicode concept to be more like what it wanted. The Perl documentation says "\X matches quite well what normal (non-Unicode-programmer) usage would consider a single character", that is, a logical character. As, Unicode TR29 says, "It is important to recognize that what the user thinks of as a "character"—a basic unit of a writing system for a language—may not be just a single Unicode code point. Instead, that basic unit may be made up of multiple Unicode code points. To avoid ambiguity with the computer use of the term character, this is called a user-perceived character. For example, “G” + acute-accent is a user-perceived character: users think of it as a single character, yet is actually represented by two Unicode code points. These user-perceived characters are approximated by what is called a grapheme cluster, which can be determined programmatically." What this means, I believe, is that \X should be a "grapheme cluster" instead of something like a "combining character sequence", . And I propose to change it to be so. Actually, I propose to change it to be the latest type of grapheme cluster, the "extended grapheme cluster". The "combining character sequence" is intended by Unicode to be used for normalization purposes, not to define logical characters. (Unicode did not always have the concept of the grapheme cluster defined, but it does now, and it appears to me to be what \X is supposed to mean, and in the areas that Unicode has examined, to use Jarkko's words, "Unicode knows better than Perl") What are the implications of changing? It turns out that the \pM* component of the Perl definition of \X is almost the same as that of the same part of an extended grapheme cluster. The difference is that the Unicode definition includes 11 more characters than the current Perl one, the ZWJ, ZWNJ, two Japanese that should be marks but aren't because they were brought in with characteristics based on a pre-existing standard, and 4 Thai, and 3 Laotian characters. So this part of the definition change I think should not adversely affect any existing code. The principal difference is the beginning component. The Perl definition can fail to match input if the next character is a mark. The Unicode definition is guaranteed to match at least one character. And this seems like a bug in the Perl definition to me. \X is like a logical '.' '.' always matches a character; therefore so should \X, and Unicode agrees. It is rare to have a mark be in isolation, but it can happen. The Standard gives the example of text talking about a mark. The Unicode definition also forbids the splitting of a CR NL sequence. As far as I know, these rarely happen in Perl because of the input processing, but one can certainly create a string with this sequence. I started to have a side discussion with Tom about this, but now think the wider community should be involved. If people feel this would break existing code, we could add the capability to revert to the old semantics, by adding another switch to the legacy pragma. For your information, the Unicode extended grapheme base definition, using their terminology, is reproduced below. For further information, see http://www.unicode.org/reports/tr29/. Comments ( CRLF | Prepend* ( Hangul-syllable | !Control ) ( Grapheme_Extend | Spacing_Mark)* | . ) Prepend matches 5 Thai and 5 Lao characters that behave weirdly in other ways as well. Control is not just a control character, but a few other things as well to make it come out. The combination of Grapheme_Extend or'd with Spacing_Mark is the same as \pM plus the 11 characters I mentioned above.
|