[Python-Dev] Unicode in Python and Tcl/Tk compared (was Unicode patches checked in...)

Peter Funk pf@artcom-gmbh.de
Wed, 15 Mar 2000 11:42:26 +0100 (MET)


Hi!

> > Fredrik Lundh writes:
> > >didn't notice this before, but I just realized that after the
> > >latest round of patches, the python15.dll is now 700k larger
> > >than it was for 1.5.2 (more than twice the size).
> > 
> "Andrew M. Kuchling" wrote:
> > Most of that is due to Modules/unicodedata.c, which is 2.1Mb of source
> > code, and produces a 632168-byte .o file on my Sparc.  (Will some
> > compiler systems choke on a file that large?  Could we read database
> > info from a file instead, or mmap it into memory?)
> 
M.-A. Lemburg wrote:
> That is dues to the unicodedata module being compiled
> into the DLL statically. On Unix you can build it shared too
> -- there are no direct references to it in the implementation.
> I suppose that on Windows the same should be done... the
> question really is whether this is intended or not -- moving
> the module into a DLL is at least technically no problem
> (someone would have to supply a patch for the MSVC project
> files though).
> 
> Note that unicodedata is only needed by programs which do
> a lot of Unicode manipulations and in the future probably
> by some codecs too.

Now as the unicode patches were checked in and as Fredrik Lundh
noticed a considerable increase of the size of the python-DLL,
which was obviously mostly caused by those tables, I had some fear
that a Python/Tcl/Tk based application could eat up much more memory,
if we update from Python1.5.2 and Tcl/Tk 8.0.5 
to Python 1.6 and Tcl/Tk 8.3.0.

As some of you certainly know, some kind of unicode support has
also been added to Tcl/Tk since 8.1.  So I did some research and
would like to share what I have found out so far:

Here are the compared sizes of the tcl/tk shared libs on Linux:

   old:                   | new:                   | bloat increase in %:
   -----------------------+------------------------+---------------------
   libtcl8.0.so    533414 | libtcl8.3.so    610241 | 14.4 %
   libtk8.0.so     714908 | libtk8.3.so     811916 | 13.6 %

The addition of unicode wasn't the only change to TclTk.  So this
seems reasonable.  Unfortunately there is no python shared library,
so a direct comparison of increased memory consumption is impossible.
Nevertheless I've the following figures (stripped binary sizes of
the Python interpreter):
   1.5.2           382616 
   CVS_10-02-00    393668 (a month before unicode)
   CVS_12-03-00    507448 (just after unicode)
That is an increase of "only" 111 kBytes.  Not so bad but nevertheless
a "bloat increase" of 32.6 %.  And additionally there is now
   unicodedata.so  634940 
   _codecsmodule.so 38955 
which (I guess) will also be loaded if the application starts using some
of the new features.

Since I didn't take care of unicode in the past, I feel unable to
compare the implementations of unicode in both systems and what impact
they will have on the real memory performance and even more important on
the functionality of the combined use of both packages together with
Tkinter.

Tcl/Tk keeps around a sub-directory called 'encoding', which --I guess--
contains information somehow similar or related to that in 'unicodedata.so', 
but separated into several files?

So below I included a shortened excerpts from the 200k+ tcl8.3.0/changes
and the tk8.3.0/changes files about unicode.  May be someone
else more involved with unicode can shed some light on this topic?

Do we need some changes to Tkinter.py or _tkinter or both?

---- 8< ---- 8< ---- cut here ---- 8< ---- schnipp ---- 8< ---- schnapp ----
[...]
======== Changes for 8.1 go below this line ========

6/18/97 (new feature) Tcl now supports international character sets:
    - All C APIs now accept UTF-8 strings instead of iso8859-1 strings,
      wherever you see "char *", unless explicitly noted otherwise.
    - All Tcl strings represented in UTF-8, which is a convenient
      multi-byte encoding of Unicode.  Variable names, procedure names,
      and all other values in Tcl may include arbitrary Unicode characters.
      For example, the Tcl command "string length" returns how many
      Unicode characters are in the argument string.
    - For Java compatibility, embedded null bytes in C strings are
      represented as \xC080 in UTF-8 strings, but the null byte at the end
      of a UTF-8 string remains \0.  Thus Tcl strings once again do not
      contain null bytes, except for termination bytes.
    - For Java compatibility, "\uXXXX" is used in Tcl to enter a Unicode
      character.  "\u0000" through "\uffff" are acceptable Unicode 
      characters.  
    - "\xXX" is used to enter a small Unicode character (between 0 and 255)
      in Tcl.
    - Tcl automatically translates between UTF-8 and the normal encoding for
      the platform during interactions with the system.
    - The fconfigure command now supports a -encoding option for specifying
      the encoding of an open file or socket.  Tcl will automatically
      translate between the specified encoding and UTF-8 during I/O. 
      See the directory library/encoding to find out what encodings are
      supported (eventually there will be an "encoding" command that
      makes this information more accessible).
    - There are several new C APIs that support UTF-8 and various encodings.
      See Utf.3 for procedures that translate between Unicode and UTF-8
      and manipulate UTF-8 strings. See Encoding.3 for procedures that
      create new encodings and translate between encodings.  See
      ToUpper.3 for procedures that perform case conversions on UTF-8
      strings.
[...]
1/16/98 (new feature) Tk now supports international characters sets:
    - Font display mechanism overhauled to display Unicode strings
      containing full set of international characters.  You do not need
      Unicode fonts on your system in order to use tk or see international
      characters.  For those familiar with the Japanese or Chinese patches,
      there is no "-kanjifont" option.  Characters from any available fonts
      will automatically be used if the widget's originally selected font is
      not capable of displaying a given character.  
    - Textual widgets are international aware.  For instance, cursor
      positioning commands would now move the cursor forwards/back by 1
      international character, not by 1 byte.  
    - Input Method Editors (IMEs) work on Mac and Windows.  Unix is still in
      progress.
[...]
10/15/98 (bug fix) Changed regexp and string commands to properly
handle case folding according to the Unicode character
tables. (stanton)

10/21/98 (new feature) Added an "encoding" command to facilitate
translations of strings between different character encodings.  See
the encoding.n manual entry for more details. (stanton)

11/3/98 (bug fix) The regular expression character classification
syntax now includes Unicode characters in the supported
classes. (stanton)
[...]
11/17/98 (bug fix) "scan" now correctly handles Unicode
characters. (stanton)
[...]
11/19/98 (bug fix) Fixed menus and titles so they properly display
Unicode characters under Windows. [Bug: 819] (stanton)
[...]
4/2/99 (new apis)  Made various Unicode utility functions public.
Tcl_UtfToUniCharDString, Tcl_UniCharToUtfDString, Tcl_UniCharLen,
Tcl_UniCharNcmp, Tcl_UniCharIsAlnum, Tcl_UniCharIsAlpha,
Tcl_UniCharIsDigit, Tcl_UniCharIsLower, Tcl_UniCharIsSpace,
Tcl_UniCharIsUpper, Tcl_UniCharIsWordChar, Tcl_WinUtfToTChar,
Tcl_WinTCharToUtf (stanton)
[...]
4/5/99 (bug fix) Fixed handling of Unicode in text searches.  The
-count option was returning byte counts instead of character counts.
[...]
5/18/99 (bug fix) Fixed clipboard code so it handles Unicode data
properly on Windows NT and 95. [Bug: 1791] (stanton)
[...]
6/3/99  (bug fix) Fixed selection code to handle Unicode data in
COMPOUND_TEXT and STRING selections.  [Bug: 1791] (stanton)
[...]
6/7/99  (new feature) Optimized string index, length, range, and
append commands. Added a new Unicode object type. (hershey)
[...]
6/14/99 (new feature) Merged string and Unicode object types.  Added
new public Tcl API functions:  Tcl_NewUnicodeObj, Tcl_SetUnicodeObj,
Tcl_GetUnicode, Tcl_GetUniChar, Tcl_GetCharLength, Tcl_GetRange,
Tcl_AppendUnicodeToObj. (hershey)
[...]
6/23/99 (new feature) Updated Unicode character tables to reflect
Unicode 2.1 data. (stanton)
[...]

--- Released 8.3.0, February 10, 2000 --- See ChangeLog for details ---
---- 8< ---- 8< ---- cut here ---- 8< ---- schnipp ---- 8< ---- schnapp ----

Sorry if this was boring old stuff for some of you.

Best Regards, Peter
-- 
Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260
office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)