Unicode patches checked in
I've just checked in a massive patch from Marc-Andre Lemburg which adds Unicode support to Python. This work was financially supported by Hewlett-Packard. Marc-Andre has done a tremendous amount of work, for which I cannot thank him enough. We're still awaiting some more things: Marc-Andre gave me documentation patches which will be reviewed by Fred Drake before they are checked in; Fredrik Lundh has developed a new regular expression which is Unicode-aware and which should be checked in real soon now. Also, the documentation is probably incomplete and will be updated, and of course there may be bugs -- this should be considered alpha software. However, I believe it is quite good already, otherwise I wouldn't have checked it in! I'd like to invite everyone with an interest in Unicode or Python 1.6 to check out this new Unicode-aware Python, so that we can ensure a robust code base by the time Python 1.6 is released (planned release date: June 1, 2000). The download links are below. Links: http://www.python.org/download/cvs.html Instructions on how to get access to the CVS version. (David Ascher is making nightly tarballs of the CVS version available at http://starship.python.net/crew/da/pythondists/) http://starship.python.net/crew/lemburg/unicode-proposal.txt The latest version of the specification on which the Marc has based his implementation. http://www.python.org/sigs/i18n-sig/ Home page of the i18n-sig (Internationalization SIG), which has lots of other links about this and related issues. http://www.python.org/search/search_bugs.html The Python Bugs List. Use this for all bug reports. Note that next Tuesday I'm going on a 10-day trip, with limited time to read email and no time to solve problems. The usual crowd will take care of urgent updates. See you at the Intel Computing Continuum Conference in San Francisco or at the Python Track at Software Development 2000 in San Jose! --Guido van Rossum (home page: http://www.python.org/~guido/)
I've just checked in a massive patch from Marc-Andre Lemburg which adds Unicode support to Python.
massive, indeed. didn't notice this before, but I just realized that after the latest round of patches, the python15.dll is now 700k larger than it was for 1.5.2 (more than twice the size). my original unicode DLL was 13k. hmm... </F>
Fredrik Lundh writes:
didn't notice this before, but I just realized that after the latest round of patches, the python15.dll is now 700k larger than it was for 1.5.2 (more than twice the size).
Most of that is due to Modules/unicodedata.c, which is 2.1Mb of source code, and produces a 632168-byte .o file on my Sparc. (Will some compiler systems choke on a file that large? Could we read database info from a file instead, or mmap it into memory?) -- A.M. Kuchling http://starship.python.net/crew/amk/ "Are you OK, dressed like that? You don't seem to notice the cold." "I haven't come ten thousand miles to discuss the weather, Mr Moberly." -- Moberly and the Doctor, in "The Seeds of Doom"
"Andrew M. Kuchling" wrote:
Fredrik Lundh writes:
didn't notice this before, but I just realized that after the latest round of patches, the python15.dll is now 700k larger than it was for 1.5.2 (more than twice the size).
Most of that is due to Modules/unicodedata.c, which is 2.1Mb of source code, and produces a 632168-byte .o file on my Sparc. (Will some compiler systems choke on a file that large? Could we read database info from a file instead, or mmap it into memory?)
That is dues to the unicodedata module being compiled into the DLL statically. On Unix you can build it shared too -- there are no direct references to it in the implementation. I suppose that on Windows the same should be done... the question really is whether this is intended or not -- moving the module into a DLL is at least technically no problem (someone would have to supply a patch for the MSVC project files though). Note that unicodedata is only needed by programs which do a lot of Unicode manipulations and in the future probably by some codecs too. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
Hi!
Fredrik Lundh writes:
didn't notice this before, but I just realized that after the latest round of patches, the python15.dll is now 700k larger than it was for 1.5.2 (more than twice the size).
"Andrew M. Kuchling" wrote:
Most of that is due to Modules/unicodedata.c, which is 2.1Mb of source code, and produces a 632168-byte .o file on my Sparc. (Will some compiler systems choke on a file that large? Could we read database info from a file instead, or mmap it into memory?)
M.-A. Lemburg wrote:
That is dues to the unicodedata module being compiled into the DLL statically. On Unix you can build it shared too -- there are no direct references to it in the implementation. I suppose that on Windows the same should be done... the question really is whether this is intended or not -- moving the module into a DLL is at least technically no problem (someone would have to supply a patch for the MSVC project files though).
Note that unicodedata is only needed by programs which do a lot of Unicode manipulations and in the future probably by some codecs too.
Now as the unicode patches were checked in and as Fredrik Lundh noticed a considerable increase of the size of the python-DLL, which was obviously mostly caused by those tables, I had some fear that a Python/Tcl/Tk based application could eat up much more memory, if we update from Python1.5.2 and Tcl/Tk 8.0.5 to Python 1.6 and Tcl/Tk 8.3.0. As some of you certainly know, some kind of unicode support has also been added to Tcl/Tk since 8.1. So I did some research and would like to share what I have found out so far: Here are the compared sizes of the tcl/tk shared libs on Linux: old: | new: | bloat increase in %: -----------------------+------------------------+--------------------- libtcl8.0.so 533414 | libtcl8.3.so 610241 | 14.4 % libtk8.0.so 714908 | libtk8.3.so 811916 | 13.6 % The addition of unicode wasn't the only change to TclTk. So this seems reasonable. Unfortunately there is no python shared library, so a direct comparison of increased memory consumption is impossible. Nevertheless I've the following figures (stripped binary sizes of the Python interpreter): 1.5.2 382616 CVS_10-02-00 393668 (a month before unicode) CVS_12-03-00 507448 (just after unicode) That is an increase of "only" 111 kBytes. Not so bad but nevertheless a "bloat increase" of 32.6 %. And additionally there is now unicodedata.so 634940 _codecsmodule.so 38955 which (I guess) will also be loaded if the application starts using some of the new features. Since I didn't take care of unicode in the past, I feel unable to compare the implementations of unicode in both systems and what impact they will have on the real memory performance and even more important on the functionality of the combined use of both packages together with Tkinter. Tcl/Tk keeps around a sub-directory called 'encoding', which --I guess-- contains information somehow similar or related to that in 'unicodedata.so', but separated into several files? So below I included a shortened excerpts from the 200k+ tcl8.3.0/changes and the tk8.3.0/changes files about unicode. May be someone else more involved with unicode can shed some light on this topic? Do we need some changes to Tkinter.py or _tkinter or both? ---- 8< ---- 8< ---- cut here ---- 8< ---- schnipp ---- 8< ---- schnapp ---- [...] ======== Changes for 8.1 go below this line ======== 6/18/97 (new feature) Tcl now supports international character sets: - All C APIs now accept UTF-8 strings instead of iso8859-1 strings, wherever you see "char *", unless explicitly noted otherwise. - All Tcl strings represented in UTF-8, which is a convenient multi-byte encoding of Unicode. Variable names, procedure names, and all other values in Tcl may include arbitrary Unicode characters. For example, the Tcl command "string length" returns how many Unicode characters are in the argument string. - For Java compatibility, embedded null bytes in C strings are represented as \xC080 in UTF-8 strings, but the null byte at the end of a UTF-8 string remains \0. Thus Tcl strings once again do not contain null bytes, except for termination bytes. - For Java compatibility, "\uXXXX" is used in Tcl to enter a Unicode character. "\u0000" through "\uffff" are acceptable Unicode characters. - "\xXX" is used to enter a small Unicode character (between 0 and 255) in Tcl. - Tcl automatically translates between UTF-8 and the normal encoding for the platform during interactions with the system. - The fconfigure command now supports a -encoding option for specifying the encoding of an open file or socket. Tcl will automatically translate between the specified encoding and UTF-8 during I/O. See the directory library/encoding to find out what encodings are supported (eventually there will be an "encoding" command that makes this information more accessible). - There are several new C APIs that support UTF-8 and various encodings. See Utf.3 for procedures that translate between Unicode and UTF-8 and manipulate UTF-8 strings. See Encoding.3 for procedures that create new encodings and translate between encodings. See ToUpper.3 for procedures that perform case conversions on UTF-8 strings. [...] 1/16/98 (new feature) Tk now supports international characters sets: - Font display mechanism overhauled to display Unicode strings containing full set of international characters. You do not need Unicode fonts on your system in order to use tk or see international characters. For those familiar with the Japanese or Chinese patches, there is no "-kanjifont" option. Characters from any available fonts will automatically be used if the widget's originally selected font is not capable of displaying a given character. - Textual widgets are international aware. For instance, cursor positioning commands would now move the cursor forwards/back by 1 international character, not by 1 byte. - Input Method Editors (IMEs) work on Mac and Windows. Unix is still in progress. [...] 10/15/98 (bug fix) Changed regexp and string commands to properly handle case folding according to the Unicode character tables. (stanton) 10/21/98 (new feature) Added an "encoding" command to facilitate translations of strings between different character encodings. See the encoding.n manual entry for more details. (stanton) 11/3/98 (bug fix) The regular expression character classification syntax now includes Unicode characters in the supported classes. (stanton) [...] 11/17/98 (bug fix) "scan" now correctly handles Unicode characters. (stanton) [...] 11/19/98 (bug fix) Fixed menus and titles so they properly display Unicode characters under Windows. [Bug: 819] (stanton) [...] 4/2/99 (new apis) Made various Unicode utility functions public. Tcl_UtfToUniCharDString, Tcl_UniCharToUtfDString, Tcl_UniCharLen, Tcl_UniCharNcmp, Tcl_UniCharIsAlnum, Tcl_UniCharIsAlpha, Tcl_UniCharIsDigit, Tcl_UniCharIsLower, Tcl_UniCharIsSpace, Tcl_UniCharIsUpper, Tcl_UniCharIsWordChar, Tcl_WinUtfToTChar, Tcl_WinTCharToUtf (stanton) [...] 4/5/99 (bug fix) Fixed handling of Unicode in text searches. The -count option was returning byte counts instead of character counts. [...] 5/18/99 (bug fix) Fixed clipboard code so it handles Unicode data properly on Windows NT and 95. [Bug: 1791] (stanton) [...] 6/3/99 (bug fix) Fixed selection code to handle Unicode data in COMPOUND_TEXT and STRING selections. [Bug: 1791] (stanton) [...] 6/7/99 (new feature) Optimized string index, length, range, and append commands. Added a new Unicode object type. (hershey) [...] 6/14/99 (new feature) Merged string and Unicode object types. Added new public Tcl API functions: Tcl_NewUnicodeObj, Tcl_SetUnicodeObj, Tcl_GetUnicode, Tcl_GetUniChar, Tcl_GetCharLength, Tcl_GetRange, Tcl_AppendUnicodeToObj. (hershey) [...] 6/23/99 (new feature) Updated Unicode character tables to reflect Unicode 2.1 data. (stanton) [...] --- Released 8.3.0, February 10, 2000 --- See ChangeLog for details --- ---- 8< ---- 8< ---- cut here ---- 8< ---- schnipp ---- 8< ---- schnapp ---- Sorry if this was boring old stuff for some of you. Best Regards, Peter -- Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260 office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)
M.-A. Lemburg wrote:
Note that unicodedata is only needed by programs which do a lot of Unicode manipulations and in the future probably by some codecs too.
Perhaps it would make sense to move the Unicode database on the Python side (write it in Python)? Or init the database dynamically in the unicodedata module on import? It's quite big, so if it's possible to avoid the static declaration (and if the unicodata module is enabled by default), I'd vote for a dynamic initialization of the database from reference (Python ?) file(s). M-A, is something in this spirit doable? -- Vladimir MARANGOZOV | Vladimir.Marangozov@inrialpes.fr http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252
Vladimir Marangozov wrote:
M.-A. Lemburg wrote:
Note that unicodedata is only needed by programs which do a lot of Unicode manipulations and in the future probably by some codecs too.
Perhaps it would make sense to move the Unicode database on the Python side (write it in Python)? Or init the database dynamically in the unicodedata module on import? It's quite big, so if it's possible to avoid the static declaration (and if the unicodata module is enabled by default), I'd vote for a dynamic initialization of the database from reference (Python ?) file(s).
The unicodedatabase module contains the Unicode database as static C data - this makes it shareable among (Python) processes. Python modules don't provide this feature: instead a dictionary would have to be built on import which would increase the heap size considerably. Those dicts would *not* be shareable. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
[me]
Perhaps it would make sense to move the Unicode database on the Python side (write it in Python)? Or init the database dynamically in the unicodedata module on import? It's quite big, so if it's possible to avoid the static declaration (and if the unicodata module is enabled by default), I'd vote for a dynamic initialization of the database from reference (Python ?) file(s).
[Marc-Andre]
The unicodedatabase module contains the Unicode database as static C data - this makes it shareable among (Python) processes.
The static data is shared if the module is a shared object (.so). If unicodedata is not a .so, then you'll have a seperate copy of the database in each process.
Python modules don't provide this feature: instead a dictionary would have to be built on import which would increase the heap size considerably. Those dicts would *not* be shareable.
I haven't mentioned dicts, have I? I suggested that the entries in the C version of the database be rewritten in Python (or a text file) The unicodedata module would, in it's init function, allocate memory for the database and would populate it before returning "import okay" to Python -- this is one way to init the db dynamically, among others. As to sharing the database among different processes, this is a classic IPC pb, which has nothing to do with the static C declaration of the db. Or, hmmm, one of us is royally confused <wink>. -- Vladimir MARANGOZOV | Vladimir.Marangozov@inrialpes.fr http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252
Vladimir Marangozov wrote:
[me]
Perhaps it would make sense to move the Unicode database on the Python side (write it in Python)? Or init the database dynamically in the unicodedata module on import? It's quite big, so if it's possible to avoid the static declaration (and if the unicodata module is enabled by default), I'd vote for a dynamic initialization of the database from reference (Python ?) file(s).
[Marc-Andre]
The unicodedatabase module contains the Unicode database as static C data - this makes it shareable among (Python) processes.
The static data is shared if the module is a shared object (.so). If unicodedata is not a .so, then you'll have a seperate copy of the database in each process.
Uhm, comparing the two versions Python 1.5 and the current CVS Python I get these figures on Linux: Executing : ./python -i -c '1/0' Python 1.5: 1208kB / 728 kB (resident/shared) Python CVS: 1280kB / 808 kB ("/") Not much of a change if you ask me and the CVS version has the unicodedata module linked statically... so there's got to be some sharing and load-on-demand going on behind the scenes: this is what I was referring to when I mentioned static C data. The OS can much better deal with these sharing techniques and delayed loads than anything we could implement on top of it in C or Python. But perhaps this is Linux-specific...
Python modules don't provide this feature: instead a dictionary would have to be built on import which would increase the heap size considerably. Those dicts would *not* be shareable.
I haven't mentioned dicts, have I? I suggested that the entries in the C version of the database be rewritten in Python (or a text file) The unicodedata module would, in it's init function, allocate memory for the database and would populate it before returning "import okay" to Python -- this is one way to init the db dynamically, among others.
I'm leaving this as exercise to the interested reader ;-) Really, if you have better ideas for the unicodedata module, please go ahead.
As to sharing the database among different processes, this is a classic IPC pb, which has nothing to do with the static C declaration of the db. Or, hmmm, one of us is royally confused <wink>.
Could you check this on other platforms ? Perhaps Linux is doing more than other OSes are in this field. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
I just uploaded the first public SRE snapshot to: http://w1.132.telia.com/~u13208596/sre.htm -- this kit contains windows binaries only (make sure you have built the interpreter from a recent CVS version) -- the engine fully supports unicode target strings. (not sure about the pattern compiler, though...) -- it's probably buggy as hell. for things I'm working on at this very moment, see: http://w1.132.telia.com/~u13208596/sre/status.htm I hope to get around to fix the core dump (it crashes half- ways through sre_fulltest.py, by no apparent reason) and the backreferencing problem later today. stay tuned. </F> PS. note that "public" doesn't really mean "suitable for the c.l.python crowd", or "suitable for production use". in other words, let's keep this one on this list for now. thanks!
On Wed, 15 Mar 2000, Vladimir Marangozov wrote:
[me]
Perhaps it would make sense to move the Unicode database on the Python side (write it in Python)? Or init the database dynamically in the unicodedata module on import? It's quite big, so if it's possible to avoid the static declaration (and if the unicodata module is enabled by default), I'd vote for a dynamic initialization of the database from reference (Python ?) file(s).
[Marc-Andre]
The unicodedatabase module contains the Unicode database as static C data - this makes it shareable among (Python) processes.
The static data is shared if the module is a shared object (.so). If unicodedata is not a .so, then you'll have a seperate copy of the database in each process.
Nope. A shared module means that multiple executables can share the code. Whether the const data resides in an executable or a .so, the OS will map it into readonly memory and share it across all procsses.
Python modules don't provide this feature: instead a dictionary would have to be built on import which would increase the heap size considerably. Those dicts would *not* be shareable.
I haven't mentioned dicts, have I? I suggested that the entries in the C version of the database be rewritten in Python (or a text file) The unicodedata module would, in it's init function, allocate memory for the database and would populate it before returning "import okay" to Python -- this is one way to init the db dynamically, among others.
This would place all that data into the per-process heap. Definitely not shared, and definitely a big hit for each Python process.
As to sharing the database among different processes, this is a classic IPC pb, which has nothing to do with the static C declaration of the db. Or, hmmm, one of us is royally confused <wink>.
This isn't IPC. It is sharing of some constant data. The most effective way to manage this is through const C data. The OS will properly manage it. And sorry, David, but mmap'ing a file will simply add complexity. As jcw mentioned, the OS is pretty much doing this anyhow when it deals with a const data segment in your executable. I don't believe this is Linux specific. This kind of stuff has been done for a *long* time on the platforms, too. Side note: the most effective way of exposing this const data up to Python (without shoving it onto the heap) is through buffers created via: PyBuffer_FromMemory(ptr, size) This allows the data to reside in const, shared memory while it is also exposed up to Python. Cheers, -g -- Greg Stein, http://www.lyra.org/
Greg Stein wrote:
[me]
The static data is shared if the module is a shared object (.so). If unicodedata is not a .so, then you'll have a seperate copy of the database in each process.
Nope. A shared module means that multiple executables can share the code. Whether the const data resides in an executable or a .so, the OS will map it into readonly memory and share it across all procsses.
I must have been drunk yesterday<wink>. You're right.
I don't believe this is Linux specific. This kind of stuff has been done for a *long* time on the platforms, too.
Yes.
Side note: the most effective way of exposing this const data up to Python (without shoving it onto the heap) is through buffers created via: PyBuffer_FromMemory(ptr, size) This allows the data to reside in const, shared memory while it is also exposed up to Python.
And to avoid the size increase of the Python library, perhaps unicodedata needs to be uncommented by default in Setup.in (for the release, not now). As M-A pointed out, the module isn't isn't necessary for the normal operation of the interpreter. -- Vladimir MARANGOZOV | Vladimir.Marangozov@inrialpes.fr http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252
Vladimir Marangozov wrote:
Greg Stein wrote:
Side note: the most effective way of exposing this const data up to Python (without shoving it onto the heap) is through buffers created via: PyBuffer_FromMemory(ptr, size) This allows the data to reside in const, shared memory while it is also exposed up to Python.
And to avoid the size increase of the Python library, perhaps unicodedata needs to be uncommented by default in Setup.in (for the release, not now). As M-A pointed out, the module isn't isn't necessary for the normal operation of the interpreter.
Sounds like a familiar idea. :-) BTW., yesterday evening I wrote an analysis script, to see how far this data is compactable without going into real compression, just redundancy folding and byte/short indexing was used. If I'm not wrong, this reduces the size of the database to less than 25kb. That small amount of extra data would make the uncommenting feature quite unimportant, except for the issue of building tiny Pythons. ciao - chris -- Christian Tismer :^) mailto:tismer@appliedbiometrics.com Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF we're tired of banana software - shipped green, ripens at home
The unicodedatabase module contains the Unicode database as static C data - this makes it shareable among (Python) processes.
Python modules don't provide this feature: instead a dictionary would have to be built on import which would increase the heap size considerably. Those dicts would *not* be shareable.
I know it's complicating things, but wouldn't an mmap'ed buffer allow inter-process sharing while keeping DLL size down and everything on-disk until needed? Yes, I know, mmap calls aren't uniform across platforms and isn't supported on all platforms -- I still think that it's silly not to use it on those platforms where it is available, and I'd like to see mmap unification move forward, so this is as good a motivation as any to bite the bullet. Just a thought, --david
David Ascher wrote: [shareable unicodedatabase]
I know it's complicating things, but wouldn't an mmap'ed buffer allow inter-process sharing while keeping DLL size down and everything on-disk until needed?
AFAIK, on platforms which support mmap, static data already gets mmap'ed in by the OS (just like all code), so this might have little effect. I'm more concerned by the distribution size increase. -jcw
"M.-A. Lemburg" wrote: ...
Note that unicodedata is only needed by programs which do a lot of Unicode manipulations and in the future probably by some codecs too.
Would it be possible to make the Unicode support configurable? My problem is that patches in the CVS are of different kinds. Some are error corrections and enhancements which I would definately like to use. Others are brand new features like the Unicode support. Absolutely great stuff! But this will most probably change a number of times again, and I think it is a bad idea when I include it into my Stackless distribution. I'd appreciate it very much if I could use the same CVS tree for testing new stuff, and to build my distribution, with new features switched off. Please :-) ciao - chris -- Christian Tismer :^) mailto:tismer@appliedbiometrics.com Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF we're tired of banana software - shipped green, ripens at home
Christian Tismer wrote:
"M.-A. Lemburg" wrote: ...
Note that unicodedata is only needed by programs which do a lot of Unicode manipulations and in the future probably by some codecs too.
Would it be possible to make the Unicode support configurable?
This is currently not planned as the Unicode integration touches many different parts of the interpreter to enhance string/Unicode integration... sorry. Also, I'm not sure whether adding #ifdefs throuhgout the code would increase its elegance ;-)
My problem is that patches in the CVS are of different kinds. Some are error corrections and enhancements which I would definately like to use. Others are brand new features like the Unicode support. Absolutely great stuff! But this will most probably change a number of times again, and I think it is a bad idea when I include it into my Stackless distribution.
Why not ? All you have to do is rebuild the distribution every time you push a new version -- just like I did for the Unicode version before the CVS checkin was done.
I'd appreciate it very much if I could use the same CVS tree for testing new stuff, and to build my distribution, with new features switched off. Please :-)
-- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
"M.-A. Lemburg" wrote:
Christian Tismer wrote:
...
Absolutely great stuff! But this will most probably change a number of times again, and I think it is a bad idea when I include it into my Stackless distribution.
Why not ? All you have to do is rebuild the distribution every time you push a new version -- just like I did for the Unicode version before the CVS checkin was done.
But how can I then publish my source code, when I always pull Unicode into it. I don't like to be exposed to side effects like 700kb code bloat, just by chance, since it is in the dist right now (and will vanish again). I don't say there must be #ifdefs all and everywhere, but can I build without *using* Unicode? I don't want to introduce something new to my users what they didn't ask for. And I don't want to take care about their installations. Finally I will for sure not replace a 500k DLL by a 1.2M monster, so this is definately not what I want at the moment. How do I build a dist that doesn't need to change a lot of stuff in the user's installation? Note that Stackless Python is a drop-in replacement, not a Python distribution. Or should it be? ciao - chris (who really wants to get SLP 1.1 out) -- Christian Tismer :^) mailto:tismer@appliedbiometrics.com Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF we're tired of banana software - shipped green, ripens at home
Christian Tismer wrote:
"M.-A. Lemburg" wrote:
Christian Tismer wrote:
...
Absolutely great stuff! But this will most probably change a number of times again, and I think it is a bad idea when I include it into my Stackless distribution.
Why not ? All you have to do is rebuild the distribution every time you push a new version -- just like I did for the Unicode version before the CVS checkin was done.
But how can I then publish my source code, when I always pull Unicode into it. I don't like to be exposed to side effects like 700kb code bloat, just by chance, since it is in the dist right now (and will vanish again).
All you have to do is build the unicodedata module shared and not statically bound into python.dll. This one module causes most of the code bloat...
I don't say there must be #ifdefs all and everywhere, but can I build without *using* Unicode? I don't want to introduce something new to my users what they didn't ask for. And I don't want to take care about their installations. Finally I will for sure not replace a 500k DLL by a 1.2M monster, so this is definately not what I want at the moment.
How do I build a dist that doesn't need to change a lot of stuff in the user's installation?
I don't think that the Unicode stuff will disable the running environment... (haven't tried this though). The unicodedata module is not used by the interpreter and the rest is imported on-the-fly, not during init time, so at least in theory, not using Unicode will result in Python not looking for e.g. the encodings package.
Note that Stackless Python is a drop-in replacement, not a Python distribution. Or should it be?
Probably... I think it's simply easier to install and probably also easier to maintain because it doesn't cause dependencies on other "default" installations. The user will then explicitly know that she is installing something a little different from the default distribution... -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
CT:
How do I build a dist that doesn't need to change a lot of stuff in the user's installation?
somewhere in this thread, Guido wrote:
BTW, I added a tag "pre-unicode" to the CVS tree to the revisions before the Unicode changes were made.
maybe you could base SLP on that one? </F>
Fredrik Lundh wrote:
CT:
How do I build a dist that doesn't need to change a lot of stuff in the user's installation?
somewhere in this thread, Guido wrote:
BTW, I added a tag "pre-unicode" to the CVS tree to the revisions before the Unicode changes were made.
maybe you could base SLP on that one?
I have no idea how this works. Would this mean that I cannot get patctes which come after unicode? Meanwhile, I've looked into the sources. It is easy for me to get rid of the problem by supplying my own unicodedata.c, where I replace all functions by some unimplemented exception. Furthermore, I wondered about the data format. Is the unicode database used inyou re package as well? Otherwise, I see only references form unicodedata.c, and that means the data structure can be massively enhanced. At the moment, that baby is 64k entries long, with four bytes and an optional string. This is a big waste. The strings are almost all some distinct <xxx> prefixes, together with a list of hex smallwords. This is done as strings, probably this makes 80 percent of the space. The only function that uses the "decomposition" field (namely the string) is unicodedata_decomposition. It does nothing more than to wrap it into a PyObject. We can do a little better here. I gues I can bring it down to a third of this space without much effort, just by using - binary encoding for the <xxx> tags as enumeration - binary encoding of the hexed entries - omission of the spaces Instead of a 64 k of structures which contain pointers anyway, I can use a 64k pointer array with offsets into one packed table. The unicodedata access functions would change *slightly*, just building some hex strings and so on. I guess this is not a time critical section? Should I try this evening? :-) cheers - chris -- Christian Tismer :^) mailto:tismer@appliedbiometrics.com Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF we're tired of banana software - shipped green, ripens at home
Christian Tismer wrote:
Fredrik Lundh wrote:
CT:
How do I build a dist that doesn't need to change a lot of stuff in the user's installation?
somewhere in this thread, Guido wrote:
BTW, I added a tag "pre-unicode" to the CVS tree to the revisions before the Unicode changes were made.
maybe you could base SLP on that one?
I have no idea how this works. Would this mean that I cannot get patctes which come after unicode?
Meanwhile, I've looked into the sources. It is easy for me to get rid of the problem by supplying my own unicodedata.c, where I replace all functions by some unimplemented exception.
No need (see my other posting): simply disable the module altogether... this shouldn't hurt any part of the interpreter as the module is a user-land only module.
Furthermore, I wondered about the data format. Is the unicode database used inyou re package as well? Otherwise, I see only references form unicodedata.c, and that means the data structure can be massively enhanced. At the moment, that baby is 64k entries long, with four bytes and an optional string. This is a big waste. The strings are almost all some distinct <xxx> prefixes, together with a list of hex smallwords. This is done as strings, probably this makes 80 percent of the space.
I have made no attempt to optimize the structure... (due to lack of time mostly) the current implementation is really not much different from a rewrite of the UnicodeData.txt file availble at the unicode.org site. If you want to, I can mail you the marshalled Python dict version of that database to play with.
The only function that uses the "decomposition" field (namely the string) is unicodedata_decomposition. It does nothing more than to wrap it into a PyObject. We can do a little better here. I gues I can bring it down to a third of this space without much effort, just by using - binary encoding for the <xxx> tags as enumeration - binary encoding of the hexed entries - omission of the spaces Instead of a 64 k of structures which contain pointers anyway, I can use a 64k pointer array with offsets into one packed table.
The unicodedata access functions would change *slightly*, just building some hex strings and so on. I guess this is not a time critical section?
It may be if these functions are used in codecs, so you should pay attention to speed too...
Should I try this evening? :-)
Sure :-) go ahead... -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
"M.-A. Lemburg" wrote:
Christian Tismer wrote:
[the old data comression guy has been reanimated]
If you want to, I can mail you the marshalled Python dict version of that database to play with. ...
Should I try this evening? :-)
Sure :-) go ahead...
Thank you. Meanwhile I've heard that there is some well-known bot working on that under the hood, with a much better approach than mine. So I'll take your advice, and continue to write silly stackless enhancements. They say this is my destiny :-) ciao - continuous -- Christian Tismer :^) mailto:tismer@appliedbiometrics.com Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF we're tired of banana software - shipped green, ripens at home
"FL" == Fredrik Lundh
writes:
FL> somewhere in this thread, Guido wrote: >> BTW, I added a tag "pre-unicode" to the CVS tree to the >> revisions before the Unicode changes were made. FL> maybe you could base SLP on that one? /F's got it exactly right. Check out a new directory using a stable tag (maybe you want to base your changes on pre-unicode tag, or python 1.52?). Patch in that subtree and then eventually you'll have to merge your changes into the head of the branch. -Barry
On Wed, 15 Mar 2000, Christian Tismer wrote:
... Would it be possible to make the Unicode support configurable?
This might be interesting from the standpoint of those guys who are doing the tiny Python interpreter thingy for embedded systems.
My problem is that patches in the CVS are of different kinds. Some are error corrections and enhancements which I would definately like to use. Others are brand new features like the Unicode support. Absolutely great stuff! But this will most probably change a number of times again, and I think it is a bad idea when I include it into my Stackless distribution.
I'd appreciate it very much if I could use the same CVS tree for testing new stuff, and to build my distribution, with new features switched off. Please :-)
But! I find this reason completely off the mark. In essence, you're arguing that we should not put *any* new feature into the CVS repository because it might mess up what *you* are doing. Sorry, but that just irks me. If you want a stable Python, then don't use the CVS version. Or base it off a specific tag in CVS. Or something. Just don't ask for development to be stopped. Cheers, -g -- Greg Stein, http://www.lyra.org/
Greg Stein wrote:
On Wed, 15 Mar 2000, Christian Tismer wrote:
... Would it be possible to make the Unicode support configurable?
This might be interesting from the standpoint of those guys who are doing the tiny Python interpreter thingy for embedded systems.
My problem is that patches in the CVS are of different kinds. Some are error corrections and enhancements which I would definately like to use. Others are brand new features like the Unicode support. Absolutely great stuff! But this will most probably change a number of times again, and I think it is a bad idea when I include it into my Stackless distribution.
I'd appreciate it very much if I could use the same CVS tree for testing new stuff, and to build my distribution, with new features switched off. Please :-)
But! I find this reason completely off the mark. In essence, you're arguing that we should not put *any* new feature into the CVS repository because it might mess up what *you* are doing.
No, this is your interpretation, and a reduction which I can't follow. There are inprovements and features in the CVS version which I need. I prefer to build against it, instead of the old 1.5.2. What's wrong with that? I want to find a way that gives me the least trouble in doing so.
Sorry, but that just irks me. If you want a stable Python, then don't use the CVS version. Or base it off a specific tag in CVS. Or something. Just don't ask for development to be stopped.
No, I ask for development to be stopped. Code freeze until Y3k :-) Why are you trying to put such a nonsense into my mouth? You know that I know that you know better. ciao - chris -- Christian Tismer :^) mailto:tismer@appliedbiometrics.com Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF we're tired of banana software - shipped green, ripens at home
On Thu, 16 Mar 2000, Christian Tismer wrote:
Greg Stein wrote: ...
Sorry, but that just irks me. If you want a stable Python, then don't use the CVS version. Or base it off a specific tag in CVS. Or something. Just don't ask for development to be stopped.
No, I ask for development to be stopped. Code freeze until Y3k :-) Why are you trying to put such a nonsense into my mouth? You know that I know that you know better.
Simply because that is what it sounds like on this side of my monitor :-) I'm seeing your request as asking for people to make special considerations in their patches for your custom distribution. While I don't have a problem with making Python more flexible to distro maintainers, it seemed like you were approaching it from the "wrong" angle. Like I said, making Unicode optional for the embedded space makes sense; making it optional so it doesn't bloat your distro didn't :-) Not a big deal... it is mostly a perception on my part. I also tend to dislike things that hold development back. Cheers, -g -- Greg Stein, http://www.lyra.org/
participants (11)
-
Andrew M. Kuchling
-
Barry A. Warsaw
-
Christian Tismer
-
David Ascher
-
Fredrik Lundh
-
Greg Stein
-
Guido van Rossum
-
Jean-Claude Wippler
-
M.-A. Lemburg
-
pf@artcom-gmbh.de
-
Vladimir Marangozov