[Python-Dev] Can't compile _tkinter.c with Redhat 9 (post-SF#719880)

Martin v. Löwis martin@v.loewis.de
16 Jun 2003 09:09:46 +0200


"Jeff Hobbs" <jeffh@ActiveState.com> writes:

> Can someone explain to me why moving to UCS-4 is a good thing?  

Because it simplifies processing of non-BMP characters, as it restores
the property that you get one Unicode character per string index.

> What UCS-4 support are you looking for that doesn't seem to exist?

It crashes when fed non-BMP characters. In addition, it lacks a
configuration option, or any kind of documentation telling packagers
on how to build a UCS-4 Tcl/Tk.

> While Tcl is agnostic about non-BMP chars (all 2 of them ... ha ha),
> it does have correct UCS-4 support (not completely though with how
> RedHat patched it).  This has been discussed before briefly here:
> 
> https://sourceforge.net/tracker/?func=detail&aid=578030&group_id=10894&atid=
> 110894

Which of the follow-up messages do you consider reliable information
in this report? davygrvy comments appear to be irrelevant, as they
talk about Unicode 3.0, keithp likewise. Your own comment appears to
talk about possible future changes, instead of the current code.

> A Tcl_UniChar is 32-bits and TCL_UTF_MAX is 6 (normally it is 3),
> which represents the number of utf-8 bytes that are valid in sequence.

Is that current code, or future code? How can I select a UCS-4 build
during configuration? In what way is the supported mechanism different
from the one that Redhat uses?

> I do realize that correct handling on non-BMP characters requires
> some more work, but that is orthogonal to this issue.  While UCS-4
> opens up more code points to allow non-BMP chars, there are very few
> in that range at this point.  

I couldn't find definitive numbers on distribution over planes, but I
found the following numbers:
- Unicode 3.0 has 49194 assigned characters
  (http://www.unicode.org/versions/Unicode3.0.html)
- Unicode 4.0 has 96248 graphic characters
  (http://www.unicode.org/versions/Unicode4.0.0/)

I don't know how many of the new assignments are in the BMP, but it
appears that there are roughly as many assigned BMP characters as
there are assigned characters outside the BMP.

> The bigger issue is that in changing the basic Tcl_UniChar size, you
> break the binary compatability rules.  RH9 is the only
> version/distro to use 32-bit Tcl_UniChar, which breaks compatability
> with extensions build on other versions/distros.

Indeed. Python has added explicit mechanisms to detect such breakage,
by renaming all API functions depending on the width of a Unicode
character. That, atleast, allows to detect the breakage at import
time (missing symbols).

> Also, while Tcl can build and works just find with 32-bit
> Tcl_UniChar, but I don't recall testing Tk when I tested Tcl.
> Checking on a rebuild now, it does appear that Tk operates just
> fine.  However, it does consume a lot more memory.

When I tested it, I found that it would break very easily. I was using
the Redhat procedure, though, so I might have made something wrong.

> I finally found the source RPMs for Tcl that RH9 uses and checked
> out there patch.  It's not even correct.  You have to modify
> tcl/generic/regcustom.h as well to account for Tcl_UniChar being
> 32-bits.  

What is the specific change that one has to make? "You have to edit
multiple files to activate a feature" is a strange way of supporting
it...

> IOW, it's very annoying to me that someone at RedHat went
> blundering around in the dark making these modifications when it is
> fairly easy to find and communicate with the core developers on the
> what, how and why of doing things correctly.

Indeed. In the specific case, they made the Tcl change to support
UCS-4 Python, when it would have been cleaner, IMO, to fix
_tkinter. Alas, they did not contact us, either.

Regards,
Martin