[Python-Dev] Can't compile _tkinter.c with Redhat 9 (post-SF#719880)
Jeff Hobbs
jeff@hobbs.org
Mon, 16 Jun 2003 09:40:26 -0700
> From: martin@v.loewis.de
> Jeff Hobbs writes:
>
> > Can someone explain to me why moving to UCS-4 is a good thing?
>
> Because it simplifies processing of non-BMP characters, as it restores
> the property that you get one Unicode character per string index.
Right, fair enough, that's all well understood - when you have to
deal with characters between U+10000 and U+10FFFF. It was only
recently that such characters existed in more than a sprinkling.
> > A Tcl_UniChar is 32-bits and TCL_UTF_MAX is 6 (normally it is 3),
> > which represents the number of utf-8 bytes that are valid in sequence.
>
> Is that current code, or future code? How can I select a UCS-4 build
> during configuration? In what way is the supported mechanism different
> from the one that Redhat uses?
There is no "supported" UCS-4 mode for Tcl. You have to hand-twiddle
the sources, knowing where to poke. I can make the changes for 8.5
that allow for an easy configuration option to compile in UCS-4 mode.
I suppose I could also back-port it to 8.4.4. That won't address the
fact that we've never validated non-BMP support.
> I couldn't find definitive numbers on distribution over planes, but I
> found the following numbers:
> - Unicode 3.0 has 49194 assigned characters
> (http://www.unicode.org/versions/Unicode3.0.html)
> - Unicode 4.0 has 96248 graphic characters
> (http://www.unicode.org/versions/Unicode4.0.0/)
Right, and Unicode 4.0 is fresh out of diapers. You can't even get
the regular code charts yet, you have to view the 4.0 beta ones. With
4.0 the non-BMP finally gets a notable amount of characters, but they
are fairly weird ones that I'd be surprised to find a public font for.
You can see them at:
http://www.unicode.org/charts/u40-beta.html
They are the Linear B Syllabary on down.
> > The bigger issue is that in changing the basic Tcl_UniChar size, you
> > break the binary compatability rules. RH9 is the only
> > version/distro to use 32-bit Tcl_UniChar, which breaks compatability
> > with extensions build on other versions/distros.
>
> Indeed. Python has added explicit mechanisms to detect such breakage,
> by renaming all API functions depending on the width of a Unicode
> character. That, atleast, allows to detect the breakage at import
> time (missing symbols).
Tcl could do this, but we were very much taken by surprise that it
was pushed to use UCS-4 at all.
> > Checking on a rebuild now, it does appear that Tk operates just
> > fine. However, it does consume a lot more memory.
>
> When I tested it, I found that it would break very easily. I was using
> the Redhat procedure, though, so I might have made something wrong.
Can you feed me some sample scripts offline to test with?
> > I finally found the source RPMs for Tcl that RH9 uses and checked
> > out there patch. It's not even correct. You have to modify
> > tcl/generic/regcustom.h as well to account for Tcl_UniChar being
> > 32-bits.
>
> What is the specific change that one has to make? "You have to edit
> multiple files to activate a feature" is a strange way of supporting
> it...
Ha ha ... well, I did say it was never properly supported. That
noone bothered to ask how to do it correctly when that was clear is
not a good thing. What you have to do is modify generic/tcl.h to
set TCL_UTF_MAX to 6, typedef Tcl_UniChar as unsigned int (or
wchar_t is what RH used), and then modify the bottom of
generic/regcustom.h, where you will see 3 lines that need mods for
the change in size of CHR (which is Tcl_UniChar for the RE).
Of course, that's what I think is needed. It should probably then
get extended tests for more characters and further expectations.
We should probably add a tcl_platform(unicharSize) var or something
so that users at the Tcl level know this as well. Again, this is
only something that I have tinkered with - not extensively tested.
Regards,
Jeff Hobbs The Tcl Guy
Senior Developer http://www.ActiveState.com/