Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Perl: porters

What should \X match?

 

 

Perl porters RSS feed   Index | Next | Previous | View Threaded


public at khwilliamson

Nov 27, 2009, 8:44 AM

Post #1 of 9 (547 views)
Permalink
What should \X match?

I believe the current definition of \X is flawed. First of all it isn't
the Unicode concept it purports to be.

\X is defined as qr/(?>\PM\pM*)/, and in several places in the
documentation, it says that this is a Unicode "combining character
sequence". The current definition for that concept is qr/ {base}? \pM |
\N{ZWJ} | \N{ZWNJ} )+/x where {base} is something but not exactly like
\PM. Assume for the sake of argument that it is exactly \PM. Note that
it is optional in the Unicode definition, but not the Perl. Further
note that the \pM is optional in the Perl definition, but not the
Unicode one. This means, for example, that \X matches 'A', but a
'combining character sequence' does not.

In other places in the Perl documentation, it says that \X is an
"extended Unicode combining character sequence". I don't know what this
means. Unicode has an "extended combining character sequence" which is
similar to the regular one, but includes Hangul (Korean) syllables,
again it's not the Perl definition. Perhaps what is meant is that Perl
has extended (modified) the Unicode concept to be more like what it wanted.

The Perl documentation says "\X matches quite well what normal
(non-Unicode-programmer) usage would consider a single character", that
is, a logical character. As, Unicode TR29 says, "It is important to
recognize that what the user thinks of as a "character"—a basic unit of
a writing system for a language—may not be just a single Unicode code
point. Instead, that basic unit may be made up of multiple Unicode code
points. To avoid ambiguity with the computer use of the term character,
this is called a user-perceived character. For example, “G” +
acute-accent is a user-perceived character: users think of it as a
single character, yet is actually represented by two Unicode code
points. These user-perceived characters are approximated by what is
called a grapheme cluster, which can be determined programmatically."

What this means, I believe, is that \X should be a "grapheme cluster"
instead of something like a "combining character sequence", . And I
propose to change it to be so. Actually, I propose to change it to be
the latest type of grapheme cluster, the "extended grapheme cluster".
The "combining character sequence" is intended by Unicode to be used
for normalization purposes, not to define logical characters.

(Unicode did not always have the concept of the grapheme cluster
defined, but it does now, and it appears to me to be what \X is supposed
to mean, and in the areas that Unicode has examined, to use Jarkko's
words, "Unicode knows better than Perl")

What are the implications of changing? It turns out that the \pM*
component of the Perl definition of \X is almost the same as that of the
same part of an extended grapheme cluster. The difference is that the
Unicode definition includes 11 more characters than the current Perl
one, the ZWJ, ZWNJ, two Japanese that should be marks but aren't because
they were brought in with characteristics based on a pre-existing
standard, and 4 Thai, and 3 Laotian characters. So this part of the
definition change I think should not adversely affect any existing code.

The principal difference is the beginning component. The Perl
definition can fail to match input if the next character is a mark. The
Unicode definition is guaranteed to match at least one character. And
this seems like a bug in the Perl definition to me. \X is like a
logical '.' '.' always matches a character; therefore so should \X, and
Unicode agrees. It is rare to have a mark be in isolation, but it can
happen. The Standard gives the example of text talking about a mark.

The Unicode definition also forbids the splitting of a CR NL sequence.
As far as I know, these rarely happen in Perl because of the input
processing, but one can certainly create a string with this sequence.

I started to have a side discussion with Tom about this, but now think
the wider community should be involved. If people feel this would break
existing code, we could add the capability to revert to the old
semantics, by adding another switch to the legacy pragma.

For your information, the Unicode extended grapheme base definition,
using their terminology, is reproduced below. For further information,
see http://www.unicode.org/reports/tr29/.

Comments


( CRLF
| Prepend* ( Hangul-syllable | !Control )
( Grapheme_Extend | Spacing_Mark)*
| . )

Prepend matches 5 Thai and 5 Lao characters that behave weirdly in other
ways as well. Control is not just a control character, but a few other
things as well to make it come out. The combination of Grapheme_Extend
or'd with Spacing_Mark is the same as \pM plus the 11 characters I
mentioned above.


john.imrie at vodafoneemail

Nov 27, 2009, 12:02 PM

Post #2 of 9 (522 views)
Permalink
Re: What should \X match? [In reply to]

<cut>
> What this means, I believe, is that \X should be a "grapheme cluster"
> instead of something like a "combining character sequence", . And I
> propose to change it to be so. Actually, I propose to change it to be
> the latest type of grapheme cluster, the "extended grapheme cluster".
> The "combining character sequence" is intended by Unicode to be used
> for normalization purposes, not to define logical characters.
<cut>

IIRC a grapheme cluster also contains local dependent constructs such as
'ch' in German and 'll' in Welsh. Will your new code handle this or
should we ignore these constructs? The reason I bring this up is because
Unicode collation code makes use of these.

John

______________________________________________
This email has been scanned by Netintelligence
http://www.netintelligence.com/email


jhi at iki

Nov 27, 2009, 2:05 PM

Post #3 of 9 (518 views)
Permalink
Re: What should \X match? [In reply to]

I think the answer is simply that the definition of \X very probably
just was never properly updated as the definitions and terminology of
Unicode changed over time.

What we need to do is to not to get too hung up on the exact words used
in the current documentation and instead try to follow the *intention*
of \X which I believe was simply "single base character followed by zero
or more diacritics". If the current Unicode doesn't include anything
quite that simple, but it does contain something similar, then let's
take that. And then update the implementation/documentation to agree
with the current state of things. And be "we" I mean "you".


public at khwilliamson

Nov 27, 2009, 3:07 PM

Post #4 of 9 (524 views)
Permalink
Re: What should \X match? [In reply to]

John wrote:
> <cut>
>> What this means, I believe, is that \X should be a "grapheme cluster"
>> instead of something like a "combining character sequence", . And I
>> propose to change it to be so. Actually, I propose to change it to be
>> the latest type of grapheme cluster, the "extended grapheme cluster".
>> The "combining character sequence" is intended by Unicode to be used
>> for normalization purposes, not to define logical characters.
> <cut>
>
> IIRC a grapheme cluster also contains local dependent constructs such as
> 'ch' in German and 'll' in Welsh. Will your new code handle this or
> should we ignore these constructs? The reason I bring this up is because
> Unicode collation code makes use of these.

Unicode specifies a basic grapheme cluster, and applications are free to
tailor it for various locales. But I don't think such tailoring belongs
in the Perl core, so I was proposing to use the un-tailored version
>
> John
>
> ______________________________________________ This email has
> been scanned by Netintelligence http://www.netintelligence.com/email
>


john.imrie at vodafoneemail

Nov 27, 2009, 3:57 PM

Post #5 of 9 (525 views)
Permalink
Re: What should \X match? [In reply to]

Karl,
>
> Unicode specifies a basic grapheme cluster, and applications are free
> to tailor it for various locales. But I don't think such tailoring
> belongs in the Perl core, so I was proposing to use the un-tailored
> version

Thanks for the reply, that's probably the best way to go then.

John

______________________________________________
This email has been scanned by Netintelligence
http://www.netintelligence.com/email


ben at morrow

Nov 27, 2009, 4:26 PM

Post #6 of 9 (524 views)
Permalink
Re: What should \X match? [In reply to]

Quoth public [at] khwilliamson (karl williamson):
> John wrote:
> >
> > IIRC a grapheme cluster also contains local dependent constructs such as
> > 'ch' in German and 'll' in Welsh. Will your new code handle this or
> > should we ignore these constructs? The reason I bring this up is because
> > Unicode collation code makes use of these.
>
> Unicode specifies a basic grapheme cluster, and applications are free to
> tailor it for various locales. But I don't think such tailoring belongs
> in the Perl core, so I was proposing to use the un-tailored version

Is it worth providing a way for extensions to change what \X means (and,
possibly, any other Unicode constructions that can be locale-dependant)?
It would presumably be necessary for such changes to be lexically
scoped, and recorded in any regexen compiled while they were in scope.

Ben


john.imrie at vodafoneemail

Nov 27, 2009, 4:36 PM

Post #7 of 9 (521 views)
Permalink
Re: What should \X match? [In reply to]

> Is it worth providing a way for extensions to change what \X means (and,
> possibly, any other Unicode constructions that can be locale-dependant)?
> It would presumably be necessary for such changes to be lexically
> scoped, and recorded in any regexen compiled while they were in scope.
>
> Ben
>
>
You then run into the problem of what to do with qr//. Does it keep the
locale it was compiled with or the one it's used in?

I think part of the discussion on what \d and \w should match touched on
this. But I can't remember the outcome.

John

______________________________________________
This email has been scanned by Netintelligence
http://www.netintelligence.com/email


ikegami at adaelis

Nov 27, 2009, 8:22 PM

Post #8 of 9 (515 views)
Permalink
Re: What should \X match? [In reply to]

On Fri, Nov 27, 2009 at 11:44 AM, karl williamson
<public [at] khwilliamson>wrote:

> What this means, I believe, is that \X should be a "grapheme cluster"
> instead of something like a "combining character sequence", . And I propose
> to change it to be so. Actually, I propose to change it to be the latest
> type of grapheme cluster, the "extended grapheme cluster".
>

I agree. It seems to me that's the intent of \X.

One example is:

"\x{1100}\x{1161}" =~ /^\X$/

Currently, it doesn't match. It should since it's the decomposed form of
U+AC00. It is covered by "Hangul-syllable" in your suggestion. It's really
no different than

"\x{0065}\x{0301}" =~ /^\X$/

which does match.


jesse at fsck

Nov 29, 2009, 1:54 PM

Post #9 of 9 (481 views)
Permalink
Re: What should \X match? [In reply to]

On Fri, Nov 27, 2009 at 05:05:56PM -0500, Jarkko Hietaniemi wrote:
> I think the answer is simply that the definition of \X very probably
> just was never properly updated as the definitions and terminology of
> Unicode changed over time.
>
> What we need to do is to not to get too hung up on the exact words used
> in the current documentation and instead try to follow the *intention*
> of \X which I believe was simply "single base character followed by zero
> or more diacritics". If the current Unicode doesn't include anything
> quite that simple, but it does contain something similar, then let's
> take that. And then update the implementation/documentation to agree
> with the current state of things. And be "we" I mean "you".

+1

--

Perl porters RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.