From: firstname.lastname@example.org Jeff Hobbs writes:
Can someone explain to me why moving to UCS-4 is a good thing?
Because it simplifies processing of non-BMP characters, as it restores the property that you get one Unicode character per string index.
Right, fair enough, that's all well understood - when you have to deal with characters between U+10000 and U+10FFFF. It was only recently that such characters existed in more than a sprinkling.
A Tcl_UniChar is 32-bits and TCL_UTF_MAX is 6 (normally it is 3), which represents the number of utf-8 bytes that are valid in sequence.
Is that current code, or future code? How can I select a UCS-4 build during configuration? In what way is the supported mechanism different from the one that Redhat uses?
There is no "supported" UCS-4 mode for Tcl. You have to hand-twiddle the sources, knowing where to poke. I can make the changes for 8.5 that allow for an easy configuration option to compile in UCS-4 mode. I suppose I could also back-port it to 8.4.4. That won't address the fact that we've never validated non-BMP support.
I couldn't find definitive numbers on distribution over planes, but I found the following numbers:
Right, and Unicode 4.0 is fresh out of diapers. You can't even get the regular code charts yet, you have to view the 4.0 beta ones. With 4.0 the non-BMP finally gets a notable amount of characters, but they are fairly weird ones that I'd be surprised to find a public font for. You can see them at: http://www.unicode.org/charts/u40-beta.html They are the Linear B Syllabary on down.
The bigger issue is that in changing the basic Tcl_UniChar size, you break the binary compatability rules. RH9 is the only version/distro to use 32-bit Tcl_UniChar, which breaks compatability with extensions build on other versions/distros.
Indeed. Python has added explicit mechanisms to detect such breakage, by renaming all API functions depending on the width of a Unicode character. That, atleast, allows to detect the breakage at import time (missing symbols).
Tcl could do this, but we were very much taken by surprise that it was pushed to use UCS-4 at all.
Checking on a rebuild now, it does appear that Tk operates just fine. However, it does consume a lot more memory.
When I tested it, I found that it would break very easily. I was using the Redhat procedure, though, so I might have made something wrong.
Can you feed me some sample scripts offline to test with?
I finally found the source RPMs for Tcl that RH9 uses and checked out there patch. It's not even correct. You have to modify tcl/generic/regcustom.h as well to account for Tcl_UniChar being 32-bits.
What is the specific change that one has to make? "You have to edit multiple files to activate a feature" is a strange way of supporting it...
Ha ha ... well, I did say it was never properly supported. That noone bothered to ask how to do it correctly when that was clear is not a good thing. What you have to do is modify generic/tcl.h to set TCL_UTF_MAX to 6, typedef Tcl_UniChar as unsigned int (or wchar_t is what RH used), and then modify the bottom of generic/regcustom.h, where you will see 3 lines that need mods for the change in size of CHR (which is Tcl_UniChar for the RE).
Of course, that's what I think is needed. It should probably then get extended tests for more characters and further expectations. We should probably add a tcl_platform(unicharSize) var or something so that users at the Tcl level know this as well. Again, this is only something that I have tinkered with - not extensively tested.
Jeff Hobbs The Tcl Guy Senior Developer http://www.ActiveState.com/