
public at khwilliamson
Nov 21, 2009, 11:07 PM
Post #1 of 5
(311 views)
Permalink
|
|
PATCH #69018; revamped mktables
|
|
A revised mktables is available, both at git://github.com/khwilliamson/perl.git. (The branch is called mktables) It fixes the minor bug #69018, concerning accepting the erroneous \p{Script=InGreek}, and perhaps other bugs; I need to look. But it fixes a number of things which I have not bothered to write bug reports on, many of them have been aired on the p5p list over the last several months. SKIP: personal_narrative { This is close to a complete rewrite of mktables. I did not set out to do this, but as the work progressed, I discovered more and more things wrong. Having really looked into the Unicode history now, it appears to me that when the original mktables was written, it was not clear what direction Unicode would go in, and it went in a different direction than anticipated. It took me quite a while to understand the distinction between some data structures that were muddled. After that, the code I was writing got clearer. I learned a lot about Unicode; at some point I came to the shocking realization, when wondering why in tarnation did the code do that?, that I had come to know more about the Unicode standard than some of the patchers did. The old version's tables are mostly correct, but there are a number of problems with them, some subtle, some not so. The combining class table is so wrong that it could easily be the butt of jokes; I've thought of a few myself. Another problem was that many of the newer Unicode tables are unreadable by the old mktables without extensive munging of them. } end SKIP SKIP: design goals { I eventually gave up trying to fit into the existing mktables, and just rewrote things. I've tried to make this work on auto pilot, so that new Unicode releases will require a minimum of fuss. pod and .t files are generated from the data, so that other things don't have to be patched to keep up. I've also added much more input validation, so that if a new enum value is added to a field, we will know, instead of blindly ignoring it. There are more goals, but it's getting late, and hard for me to think. } end SKIP The last part of this email includes text about the changes, intended for perldelta. I'm not sure who's supposed to patch that. In addition to those, here are the other things that have changed, but aren't notable enough to mention externally. Hence, the most important changes are after the line of ### in this email. I've change the main Makefile to call this differently (besides the parameters telling it where to put the pod and test files). mktables need not actually run very often. The inputs are pretty static, just like Encode's. And the dependency is on far more files than Makefile knows about. This patch fixes the bug in mktables wherein it wrongly calculated whether it should run or not; so I've removed the -w option to it. When called that way, mktables will check and do nothing, quickly, if nothing is out-of-date. If people don't trust that, we could change it so the apparent critical dependencies are still known to Makefile, and to use the -w option to force mktables to run when those dependencies trigger it, and then to have an unconditional call to mktables as well, without the -w, so that it can check its own dependency list. That said, mktables is currently running too often; there is something in Makefile that is removing some of these output files when I don't think it should; I haven't had a chance to investigate this. The files that are generated for case mapping and folding continue to have two parts, the regular part and a hash for special cases. It turns out that a number of the special case entries could be handled just as well using the regular method; so they have been moved there. All duplicate files have been eliminated. That means that if two properties match the same exact set of code points, one file serves both. This was not so much to save disk space, as to save memory, as the same swash can now serve multiple properties. There are new options to mktables: -globlist is used to attempt to process all .txt files in the directory structure. The ones it doesn't know how to handle are processed assuming that they follow the typical .txt syntax. -P dir tells mktables to create perluniprops.pod in dir. Makefile has been changed so this goes in the standard pod directory. -T path tells mktables to create a .t file as 'path'. Makefile has been changed so this goes into t/re/uniprops.t -p tells mktables to give progress information as it works. -c tells mktables to not output range counts in the .pl files it generates. These are by default output as comments; I have found them helpful for debugging, and they don't add much disk space. Canonical.pl and Exact.pl have been replaced by Heavy.pl, which allows for more straightforward code in utf8_heavy.pl There are several new features which lay the groundwork for fixing charnames to know about all code points and named sequences; \X to match more correctly; and to allow other tables such as To/Digit.pl to be read by the Perl core. I removed a test from re/pat_advanced which relied on the old erroneous definition of \w which included superscripts as part of a word; and changed another test in regexp_unicode_prop.t for again a changed property definition. I have compared the outputs of this version and the previous and am confident that all the differences are correct. I tried to be scrupulous about using File::Spec, but tested this only on Linux and Windows boxes, so there may be mistakes that should be smoked out. I tried running perlcritic on this, but it crashed, apparently at an innocuous place. I did mostly use the Perl Best Practices. The rest of this is text intended to be suitable for perldelta. NOTE that this includes some anticipated documentation changes that haven't been submitted yet. ###################################################### Perl can now handle every Unicode character property. A new pod, perluniprops, lists all available non-Unihan character properties. By default the Unihan properties and certain others (deprecated and Unicode internal-only ones) are not exposed. See below for more details on these; there is also a section in the pod listing them, and why they are not exposed. Perl now fully supports the Unicode compound-style of using '=' and ':' in writing regular expressions: \p{property=value} and \p{property:value} (both of which mean the same thing). Perl now supports fully the Unicode loose matching rules for text between the braces in \p{...} constructs. In addition, Perl also allows underscores between digits of numbers All the Unicode-defined synonyms for properties and property values are now accepted. \p{...} matches using the Canonical_Combining_Class property were completely broken in previous Perls. This is now fixed. In previous Perls, the Unicode Decomposition_Type=Compat property and a Perl extension had the same name, which led to neither matching all the correct values (with more than 100 mistakes in one, and several thousand in the other). The Perl extension has now been renamed to be Decomposition_Type=Noncanonical (short: dt=noncanon). It has the same meaning as was previously intended, namely the union of all the non-canonical Decomposition types, with Unicode Compat being just one of those. \p{Uppercase} and \p{Lowercase} have been brought into line with the Unicode definitions. This means they each match a few more characters than previously. \p{Cntrl} now matches the same characters as \p{Control}. This means it no longer will match Private Use (gc=co), Surrogates (gc=cs), nor Format (gc=cf) code points. The Format code points represent the biggest possible problem. All but 36 of them are either officially deprecated or strongly discouraged from being used. Of those 36, likely the most widely used are the soft hyphen (U+00AD), and BOM, ZWSP, ZWNJ, WJ, and similar, plus Bi-directional controls. \p{Alpha} now matches the same characters as \p{Alphabetic}. The Perl definition included a number of things that aren't really alpha (all marks), while omitting many that were. The Unicode definition is clearly better, so we are switching to it. As a direct consequence, the definitions of \p{Alnum} and \p{Word} which depend on Alpha also change. \p{Word} also now doesn't match certain characters it wasn't supposed to, such as fractions. \p{Print} no longer matches the line control characters: tab, lf, cr, ff, vt, and nel. This brings it in line with the documentation. \p{Decomposition_Type=Canonical} now includes the Hangul syllables The Numeric type property has been extended to include the Unihan characters. There is a new Perl extension, the 'Present_In', or simply 'In' property. This is an extension of the Unicode Age property, but \p{In=5.0} matches any code point whose usage has been determined as of Unicode version 5.0. The \p{Age=5.0} only matches code points added in 5.0. A number of properties did not have the correct values for unassigned code points. This is now fixed. The affected properties are Bidi_Class, East_Asian_Width, Joining_Type, Decomposition_Type, Hangul_Syllable_Type, Numeric_Type, and Line_Break. The Default_Ignorable_Code_Point, ID_Continue, and ID_Start properties have been updated to their current definitions. Certain properties that are supposed to be Unicode internal-only were erroneously exposed by previous Perls. Use of these in regular expressions will now generate a deprecated warning message, if those warnings are enabled. The properties are: Other_Alphabetic, Other_Default_Ignorable_Code_Point, Other_Grapheme_Extend, Other_ID_Continue, Other_ID_Start, Other_Lowercase, Other_Math, and Other_Uppercase. An installation can now fairly easily change Perl to operate on any Unicode release. Perl is shipped with the latest official release, but an installation can now download any prior release, and Perl will work with that. Instructions are in perlunicode.pod An installation can now fairly easily change which Unicode properties Perl understands. As mentioned above, certain properties are by default turned off. These include all the Unihan properties (which should be accessible via the CPAN module Unicode::Unihan) and any deprecated or Unicode internal-only property that Perl has never exposed. The files in the To directory are now more clearly marked as being stable, directly usable by applications. New hash entries in them give the format of the normal entries which allows for easier machine parsing. Perl can generate files in this directory for any property, though most are suppressed. An installation can choose to change which get written. Instructions are in perluniprops.pod
|