"Jeff Hobbs" jeffh@ActiveState.com writes:
Can someone explain to me why moving to UCS-4 is a good thing?
Because it simplifies processing of non-BMP characters, as it restores the property that you get one Unicode character per string index.
What UCS-4 support are you looking for that doesn't seem to exist?
It crashes when fed non-BMP characters. In addition, it lacks a configuration option, or any kind of documentation telling packagers on how to build a UCS-4 Tcl/Tk.
While Tcl is agnostic about non-BMP chars (all 2 of them ... ha ha), it does have correct UCS-4 support (not completely though with how RedHat patched it). This has been discussed before briefly here:
Which of the follow-up messages do you consider reliable information in this report? davygrvy comments appear to be irrelevant, as they talk about Unicode 3.0, keithp likewise. Your own comment appears to talk about possible future changes, instead of the current code.
A Tcl_UniChar is 32-bits and TCL_UTF_MAX is 6 (normally it is 3), which represents the number of utf-8 bytes that are valid in sequence.
Is that current code, or future code? How can I select a UCS-4 build during configuration? In what way is the supported mechanism different from the one that Redhat uses?
I do realize that correct handling on non-BMP characters requires some more work, but that is orthogonal to this issue. While UCS-4 opens up more code points to allow non-BMP chars, there are very few in that range at this point.
I couldn't find definitive numbers on distribution over planes, but I found the following numbers: - Unicode 3.0 has 49194 assigned characters (http://www.unicode.org/versions/Unicode3.0.html) - Unicode 4.0 has 96248 graphic characters (http://www.unicode.org/versions/Unicode4.0.0/)
I don't know how many of the new assignments are in the BMP, but it appears that there are roughly as many assigned BMP characters as there are assigned characters outside the BMP.
The bigger issue is that in changing the basic Tcl_UniChar size, you break the binary compatability rules. RH9 is the only version/distro to use 32-bit Tcl_UniChar, which breaks compatability with extensions build on other versions/distros.
Indeed. Python has added explicit mechanisms to detect such breakage, by renaming all API functions depending on the width of a Unicode character. That, atleast, allows to detect the breakage at import time (missing symbols).
Also, while Tcl can build and works just find with 32-bit Tcl_UniChar, but I don't recall testing Tk when I tested Tcl. Checking on a rebuild now, it does appear that Tk operates just fine. However, it does consume a lot more memory.
When I tested it, I found that it would break very easily. I was using the Redhat procedure, though, so I might have made something wrong.
I finally found the source RPMs for Tcl that RH9 uses and checked out there patch. It's not even correct. You have to modify tcl/generic/regcustom.h as well to account for Tcl_UniChar being 32-bits.
What is the specific change that one has to make? "You have to edit multiple files to activate a feature" is a strange way of supporting it...
IOW, it's very annoying to me that someone at RedHat went blundering around in the dark making these modifications when it is fairly easy to find and communicate with the core developers on the what, how and why of doing things correctly.
Indeed. In the specific case, they made the Tcl change to support UCS-4 Python, when it would have been cleaner, IMO, to fix _tkinter. Alas, they did not contact us, either.