Hi!
Fredrik Lundh writes:
didn't notice this before, but I just realized that after the latest round of patches, the python15.dll is now 700k larger than it was for 1.5.2 (more than twice the size).
"Andrew M. Kuchling" wrote:
Most of that is due to Modules/unicodedata.c, which is 2.1Mb of source code, and produces a 632168-byte .o file on my Sparc. (Will some compiler systems choke on a file that large? Could we read database info from a file instead, or mmap it into memory?)
M.-A. Lemburg wrote:
That is dues to the unicodedata module being compiled into the DLL statically. On Unix you can build it shared too -- there are no direct references to it in the implementation. I suppose that on Windows the same should be done... the question really is whether this is intended or not -- moving the module into a DLL is at least technically no problem (someone would have to supply a patch for the MSVC project files though).
Note that unicodedata is only needed by programs which do a lot of Unicode manipulations and in the future probably by some codecs too.
Now as the unicode patches were checked in and as Fredrik Lundh noticed a considerable increase of the size of the python-DLL, which was obviously mostly caused by those tables, I had some fear that a Python/Tcl/Tk based application could eat up much more memory, if we update from Python1.5.2 and Tcl/Tk 8.0.5 to Python 1.6 and Tcl/Tk 8.3.0. As some of you certainly know, some kind of unicode support has also been added to Tcl/Tk since 8.1. So I did some research and would like to share what I have found out so far: Here are the compared sizes of the tcl/tk shared libs on Linux: old: | new: | bloat increase in %: -----------------------+------------------------+--------------------- libtcl8.0.so 533414 | libtcl8.3.so 610241 | 14.4 % libtk8.0.so 714908 | libtk8.3.so 811916 | 13.6 % The addition of unicode wasn't the only change to TclTk. So this seems reasonable. Unfortunately there is no python shared library, so a direct comparison of increased memory consumption is impossible. Nevertheless I've the following figures (stripped binary sizes of the Python interpreter): 1.5.2 382616 CVS_10-02-00 393668 (a month before unicode) CVS_12-03-00 507448 (just after unicode) That is an increase of "only" 111 kBytes. Not so bad but nevertheless a "bloat increase" of 32.6 %. And additionally there is now unicodedata.so 634940 _codecsmodule.so 38955 which (I guess) will also be loaded if the application starts using some of the new features. Since I didn't take care of unicode in the past, I feel unable to compare the implementations of unicode in both systems and what impact they will have on the real memory performance and even more important on the functionality of the combined use of both packages together with Tkinter. Tcl/Tk keeps around a sub-directory called 'encoding', which --I guess-- contains information somehow similar or related to that in 'unicodedata.so', but separated into several files? So below I included a shortened excerpts from the 200k+ tcl8.3.0/changes and the tk8.3.0/changes files about unicode. May be someone else more involved with unicode can shed some light on this topic? Do we need some changes to Tkinter.py or _tkinter or both? ---- 8< ---- 8< ---- cut here ---- 8< ---- schnipp ---- 8< ---- schnapp ---- [...] ======== Changes for 8.1 go below this line ======== 6/18/97 (new feature) Tcl now supports international character sets: - All C APIs now accept UTF-8 strings instead of iso8859-1 strings, wherever you see "char *", unless explicitly noted otherwise. - All Tcl strings represented in UTF-8, which is a convenient multi-byte encoding of Unicode. Variable names, procedure names, and all other values in Tcl may include arbitrary Unicode characters. For example, the Tcl command "string length" returns how many Unicode characters are in the argument string. - For Java compatibility, embedded null bytes in C strings are represented as \xC080 in UTF-8 strings, but the null byte at the end of a UTF-8 string remains \0. Thus Tcl strings once again do not contain null bytes, except for termination bytes. - For Java compatibility, "\uXXXX" is used in Tcl to enter a Unicode character. "\u0000" through "\uffff" are acceptable Unicode characters. - "\xXX" is used to enter a small Unicode character (between 0 and 255) in Tcl. - Tcl automatically translates between UTF-8 and the normal encoding for the platform during interactions with the system. - The fconfigure command now supports a -encoding option for specifying the encoding of an open file or socket. Tcl will automatically translate between the specified encoding and UTF-8 during I/O. See the directory library/encoding to find out what encodings are supported (eventually there will be an "encoding" command that makes this information more accessible). - There are several new C APIs that support UTF-8 and various encodings. See Utf.3 for procedures that translate between Unicode and UTF-8 and manipulate UTF-8 strings. See Encoding.3 for procedures that create new encodings and translate between encodings. See ToUpper.3 for procedures that perform case conversions on UTF-8 strings. [...] 1/16/98 (new feature) Tk now supports international characters sets: - Font display mechanism overhauled to display Unicode strings containing full set of international characters. You do not need Unicode fonts on your system in order to use tk or see international characters. For those familiar with the Japanese or Chinese patches, there is no "-kanjifont" option. Characters from any available fonts will automatically be used if the widget's originally selected font is not capable of displaying a given character. - Textual widgets are international aware. For instance, cursor positioning commands would now move the cursor forwards/back by 1 international character, not by 1 byte. - Input Method Editors (IMEs) work on Mac and Windows. Unix is still in progress. [...] 10/15/98 (bug fix) Changed regexp and string commands to properly handle case folding according to the Unicode character tables. (stanton) 10/21/98 (new feature) Added an "encoding" command to facilitate translations of strings between different character encodings. See the encoding.n manual entry for more details. (stanton) 11/3/98 (bug fix) The regular expression character classification syntax now includes Unicode characters in the supported classes. (stanton) [...] 11/17/98 (bug fix) "scan" now correctly handles Unicode characters. (stanton) [...] 11/19/98 (bug fix) Fixed menus and titles so they properly display Unicode characters under Windows. [Bug: 819] (stanton) [...] 4/2/99 (new apis) Made various Unicode utility functions public. Tcl_UtfToUniCharDString, Tcl_UniCharToUtfDString, Tcl_UniCharLen, Tcl_UniCharNcmp, Tcl_UniCharIsAlnum, Tcl_UniCharIsAlpha, Tcl_UniCharIsDigit, Tcl_UniCharIsLower, Tcl_UniCharIsSpace, Tcl_UniCharIsUpper, Tcl_UniCharIsWordChar, Tcl_WinUtfToTChar, Tcl_WinTCharToUtf (stanton) [...] 4/5/99 (bug fix) Fixed handling of Unicode in text searches. The -count option was returning byte counts instead of character counts. [...] 5/18/99 (bug fix) Fixed clipboard code so it handles Unicode data properly on Windows NT and 95. [Bug: 1791] (stanton) [...] 6/3/99 (bug fix) Fixed selection code to handle Unicode data in COMPOUND_TEXT and STRING selections. [Bug: 1791] (stanton) [...] 6/7/99 (new feature) Optimized string index, length, range, and append commands. Added a new Unicode object type. (hershey) [...] 6/14/99 (new feature) Merged string and Unicode object types. Added new public Tcl API functions: Tcl_NewUnicodeObj, Tcl_SetUnicodeObj, Tcl_GetUnicode, Tcl_GetUniChar, Tcl_GetCharLength, Tcl_GetRange, Tcl_AppendUnicodeToObj. (hershey) [...] 6/23/99 (new feature) Updated Unicode character tables to reflect Unicode 2.1 data. (stanton) [...] --- Released 8.3.0, February 10, 2000 --- See ChangeLog for details --- ---- 8< ---- 8< ---- cut here ---- 8< ---- schnipp ---- 8< ---- schnapp ---- Sorry if this was boring old stuff for some of you. Best Regards, Peter -- Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260 office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)