pf at artcom-gmbh
Mar 15, 2000, 3:42 AM
Post #1 of 1
Unicode in Python and Tcl/Tk compared (was Unicode patches checked in...)
> > Fredrik Lundh writes:
> > >didn't notice this before, but I just realized that after the
> > >latest round of patches, the python15.dll is now 700k larger
> > >than it was for 1.5.2 (more than twice the size).
> "Andrew M. Kuchling" wrote:
> > Most of that is due to Modules/unicodedata.c, which is 2.1Mb of source
> > code, and produces a 632168-byte .o file on my Sparc. (Will some
> > compiler systems choke on a file that large? Could we read database
> > info from a file instead, or mmap it into memory?)
M.-A. Lemburg wrote:
> That is dues to the unicodedata module being compiled
> into the DLL statically. On Unix you can build it shared too
> -- there are no direct references to it in the implementation.
> I suppose that on Windows the same should be done... the
> question really is whether this is intended or not -- moving
> the module into a DLL is at least technically no problem
> (someone would have to supply a patch for the MSVC project
> files though).
> Note that unicodedata is only needed by programs which do
> a lot of Unicode manipulations and in the future probably
> by some codecs too.
Now as the unicode patches were checked in and as Fredrik Lundh
noticed a considerable increase of the size of the python-DLL,
which was obviously mostly caused by those tables, I had some fear
that a Python/Tcl/Tk based application could eat up much more memory,
if we update from Python1.5.2 and Tcl/Tk 8.0.5
to Python 1.6 and Tcl/Tk 8.3.0.
As some of you certainly know, some kind of unicode support has
also been added to Tcl/Tk since 8.1. So I did some research and
would like to share what I have found out so far:
Here are the compared sizes of the tcl/tk shared libs on Linux:
old: | new: | bloat increase in %:
libtcl8.0.so 533414 | libtcl8.3.so 610241 | 14.4 %
libtk8.0.so 714908 | libtk8.3.so 811916 | 13.6 %
The addition of unicode wasn't the only change to TclTk. So this
seems reasonable. Unfortunately there is no python shared library,
so a direct comparison of increased memory consumption is impossible.
Nevertheless I've the following figures (stripped binary sizes of
the Python interpreter):
CVS_10-02-00 393668 (a month before unicode)
CVS_12-03-00 507448 (just after unicode)
That is an increase of "only" 111 kBytes. Not so bad but nevertheless
a "bloat increase" of 32.6 %. And additionally there is now
which (I guess) will also be loaded if the application starts using some
of the new features.
Since I didn't take care of unicode in the past, I feel unable to
compare the implementations of unicode in both systems and what impact
they will have on the real memory performance and even more important on
the functionality of the combined use of both packages together with
Tcl/Tk keeps around a sub-directory called 'encoding', which --I guess--
contains information somehow similar or related to that in 'unicodedata.so',
but separated into several files?
So below I included a shortened excerpts from the 200k+ tcl8.3.0/changes
and the tk8.3.0/changes files about unicode. May be someone
else more involved with unicode can shed some light on this topic?
Do we need some changes to Tkinter.py or _tkinter or both?
---- 8< ---- 8< ---- cut here ---- 8< ---- schnipp ---- 8< ---- schnapp ----
======== Changes for 8.1 go below this line ========
6/18/97 (new feature) Tcl now supports international character sets:
- All C APIs now accept UTF-8 strings instead of iso8859-1 strings,
wherever you see "char *", unless explicitly noted otherwise.
- All Tcl strings represented in UTF-8, which is a convenient
multi-byte encoding of Unicode. Variable names, procedure names,
and all other values in Tcl may include arbitrary Unicode characters.
For example, the Tcl command "string length" returns how many
Unicode characters are in the argument string.
- For Java compatibility, embedded null bytes in C strings are
represented as \xC080 in UTF-8 strings, but the null byte at the end
of a UTF-8 string remains \0. Thus Tcl strings once again do not
contain null bytes, except for termination bytes.
- For Java compatibility, "\uXXXX" is used in Tcl to enter a Unicode
character. "\u0000" through "\uffff" are acceptable Unicode
- "\xXX" is used to enter a small Unicode character (between 0 and 255)
- Tcl automatically translates between UTF-8 and the normal encoding for
the platform during interactions with the system.
- The fconfigure command now supports a -encoding option for specifying
the encoding of an open file or socket. Tcl will automatically
translate between the specified encoding and UTF-8 during I/O.
See the directory library/encoding to find out what encodings are
supported (eventually there will be an "encoding" command that
makes this information more accessible).
- There are several new C APIs that support UTF-8 and various encodings.
See Utf.3 for procedures that translate between Unicode and UTF-8
and manipulate UTF-8 strings. See Encoding.3 for procedures that
create new encodings and translate between encodings. See
ToUpper.3 for procedures that perform case conversions on UTF-8
1/16/98 (new feature) Tk now supports international characters sets:
- Font display mechanism overhauled to display Unicode strings
containing full set of international characters. You do not need
Unicode fonts on your system in order to use tk or see international
characters. For those familiar with the Japanese or Chinese patches,
there is no "-kanjifont" option. Characters from any available fonts
will automatically be used if the widget's originally selected font is
not capable of displaying a given character.
- Textual widgets are international aware. For instance, cursor
positioning commands would now move the cursor forwards/back by 1
international character, not by 1 byte.
- Input Method Editors (IMEs) work on Mac and Windows. Unix is still in
10/15/98 (bug fix) Changed regexp and string commands to properly
handle case folding according to the Unicode character
10/21/98 (new feature) Added an "encoding" command to facilitate
translations of strings between different character encodings. See
the encoding.n manual entry for more details. (stanton)
11/3/98 (bug fix) The regular expression character classification
syntax now includes Unicode characters in the supported
11/17/98 (bug fix) "scan" now correctly handles Unicode
11/19/98 (bug fix) Fixed menus and titles so they properly display
Unicode characters under Windows. [Bug: 819] (stanton)
4/2/99 (new apis) Made various Unicode utility functions public.
Tcl_UtfToUniCharDString, Tcl_UniCharToUtfDString, Tcl_UniCharLen,
Tcl_UniCharNcmp, Tcl_UniCharIsAlnum, Tcl_UniCharIsAlpha,
Tcl_UniCharIsDigit, Tcl_UniCharIsLower, Tcl_UniCharIsSpace,
Tcl_UniCharIsUpper, Tcl_UniCharIsWordChar, Tcl_WinUtfToTChar,
4/5/99 (bug fix) Fixed handling of Unicode in text searches. The
-count option was returning byte counts instead of character counts.
5/18/99 (bug fix) Fixed clipboard code so it handles Unicode data
properly on Windows NT and 95. [Bug: 1791] (stanton)
6/3/99 (bug fix) Fixed selection code to handle Unicode data in
COMPOUND_TEXT and STRING selections. [Bug: 1791] (stanton)
6/7/99 (new feature) Optimized string index, length, range, and
append commands. Added a new Unicode object type. (hershey)
6/14/99 (new feature) Merged string and Unicode object types. Added
new public Tcl API functions: Tcl_NewUnicodeObj, Tcl_SetUnicodeObj,
Tcl_GetUnicode, Tcl_GetUniChar, Tcl_GetCharLength, Tcl_GetRange,
6/23/99 (new feature) Updated Unicode character tables to reflect
Unicode 2.1 data. (stanton)
--- Released 8.3.0, February 10, 2000 --- See ChangeLog for details ---
---- 8< ---- 8< ---- cut here ---- 8< ---- schnipp ---- 8< ---- schnapp ----
Sorry if this was boring old stuff for some of you.
Best Regards, Peter
Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260
office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)