
The current documentation for Py_UNICODE states: "This type represents a 16-bit unsigned storage type which is used by Python internally as basis for holding Unicode ordinals. On platforms where wchar_t is available and also has 16-bits, Py_UNICODE is a typedef alias for wchar_t to enhance native platform compatibility. On all other platforms, Py_UNICODE is a typedef alias for unsigned short." I propose changing this to: "This type represents the storage type which is used by Python internally as the basis for holding Unicode ordinals. On platforms where wchar_t is available, Py_UNICODE is a typedef alias for wchar_t to enhance native platform compatibility. On all other platforms, Py_UNICODE is a typedef alias for unsigned short. Extension module developers should make no assumptions about the size of this type on any given platform." If no one has a problem with that, I'll make the change in CVS. -- Nick

Nicholas Bastin <nbastin@opnet.com> writes:
The current documentation for Py_UNICODE states:
"This type represents a 16-bit unsigned storage type which is used by Python internally as basis for holding Unicode ordinals. On platforms where wchar_t is available and also has 16-bits, Py_UNICODE is a typedef alias for wchar_t to enhance native platform compatibility. On all other platforms, Py_UNICODE is a typedef alias for unsigned short."
I propose changing this to:
"This type represents the storage type which is used by Python internally as the basis for holding Unicode ordinals. On platforms where wchar_t is available, Py_UNICODE is a typedef alias for wchar_t to enhance native platform compatibility. On all other platforms, Py_UNICODE is a typedef alias for unsigned short. Extension module developers should make no assumptions about the size of this type on any given platform."
If no one has a problem with that, I'll make the change in CVS.
AFAIK, you can configure Python to use 16-bits or 32-bits Unicode chars, independend from the size of wchar_t. The HAVE_USABLE_WCHAR_T macro can be used by extension writers to determine if Py_UNICODE is the same as wchar_t. At least that's my understanding, so the above seems still wrong. And +1 for trying to clean up this confusion. Thomas

Thomas Heller wrote:
AFAIK, you can configure Python to use 16-bits or 32-bits Unicode chars, independend from the size of wchar_t. The HAVE_USABLE_WCHAR_T macro can be used by extension writers to determine if Py_UNICODE is the same as wchar_t.
note that "usable" is more than just "same size"; it also implies that widechar predicates (iswalnum etc) works properly with Unicode characters, under all locales. </F>

"Fredrik Lundh" <fredrik@pythonware.com> writes:
Thomas Heller wrote:
AFAIK, you can configure Python to use 16-bits or 32-bits Unicode chars, independend from the size of wchar_t. The HAVE_USABLE_WCHAR_T macro can be used by extension writers to determine if Py_UNICODE is the same as wchar_t.
note that "usable" is more than just "same size"; it also implies that widechar predicates (iswalnum etc) works properly with Unicode characters, under all locales.
Ok, so who is going to collect the wisdom of this thread into the docs? Thomas

Fredrik Lundh wrote:
Thomas Heller wrote:
AFAIK, you can configure Python to use 16-bits or 32-bits Unicode chars, independend from the size of wchar_t. The HAVE_USABLE_WCHAR_T macro can be used by extension writers to determine if Py_UNICODE is the same as wchar_t.
note that "usable" is more than just "same size"; it also implies that widechar predicates (iswalnum etc) works properly with Unicode characters, under all locales.
Only if you intend to use --with-wctypes; a configure option which will go away soon (for exactly the reason you are referring to: the widechar predicates don't work properly under all locales). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 06 2005)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Nicholas Bastin <nbastin@opnet.com> writes:
The current documentation for Py_UNICODE states:
"This type represents a 16-bit unsigned storage type which is used by Python internally as basis for holding Unicode ordinals. On platforms where wchar_t is available and also has 16-bits, Py_UNICODE is a typedef alias for wchar_t to enhance native platform compatibility. On all other platforms, Py_UNICODE is a typedef alias for unsigned short."
I propose changing this to:
"This type represents the storage type which is used by Python internally as the basis for holding Unicode ordinals. On platforms where wchar_t is available, Py_UNICODE is a typedef alias for wchar_t to enhance native platform compatibility.
This just isn't true. Have you read ./configure --help recently?
On all other platforms, Py_UNICODE is a typedef alias for unsigned short. Extension module developers should make no assumptions about the size of this type on any given platform."
I like this last sentence, though.
If no one has a problem with that, I'll make the change in CVS.
I have a problem with replacing one lie with another :) Cheers, mwh -- Just put the user directories on a 486 with deadrat7.1 and turn the Octane into the afforementioned beer fridge and keep it in your office. The lusers won't notice the difference, except that you're more cheery during office hours. -- Pim van Riezen, asr

On May 4, 2005, at 1:02 PM, Michael Hudson wrote:
Nicholas Bastin <nbastin@opnet.com> writes:
The current documentation for Py_UNICODE states:
"This type represents a 16-bit unsigned storage type which is used by Python internally as basis for holding Unicode ordinals. On platforms where wchar_t is available and also has 16-bits, Py_UNICODE is a typedef alias for wchar_t to enhance native platform compatibility. On all other platforms, Py_UNICODE is a typedef alias for unsigned short."
I propose changing this to:
"This type represents the storage type which is used by Python internally as the basis for holding Unicode ordinals. On platforms where wchar_t is available, Py_UNICODE is a typedef alias for wchar_t to enhance native platform compatibility.
This just isn't true. Have you read ./configure --help recently?
Ok, so the above statement is true if the user does not set --enable-unicode=ucs[24] (was reading the whar_t test in configure.in, and not the generated configure help). Alternatively, we shouldn't talk about the size at all, and just leave the first and last sentences: "This type represents the storage type which is used by Python internally as the basis for holding Unicode ordinals. Extension module developers should make no assumptions about the size of this type on any given platform." -- Nick

Nicholas Bastin wrote:
"This type represents the storage type which is used by Python internally as the basis for holding Unicode ordinals. Extension module developers should make no assumptions about the size of this type on any given platform."
But people want to know "Is Python's Unicode 16-bit or 32-bit?" So the documentation should explicitly say "it depends". Regards, Martin

Martin v. Löwis wrote:
Nicholas Bastin wrote:
"This type represents the storage type which is used by Python internally as the basis for holding Unicode ordinals. Extension module developers should make no assumptions about the size of this type on any given platform."
But people want to know "Is Python's Unicode 16-bit or 32-bit?" So the documentation should explicitly say "it depends".
On a related note, it would be help if the documentation provided a little more background on unicode encoding. Specifically, that UCS-2 is not the same as UTF-16, even though they're both two bytes wide and most of the characters are the same. UTF-16 can encode 4 byte characters, while UCS-2 can't. A Py_UNICODE is either UCS-2 or UCS-4. It took me quite some time to figure that out so I could produce a patch [1]_ for PyXPCOM that fixes its unicode support. .. [1] https://bugzilla.mozilla.org/show_bug.cgi?id=281156 Shane

On May 4, 2005, at 6:20 PM, Shane Hathaway wrote:
Martin v. Löwis wrote:
Nicholas Bastin wrote:
"This type represents the storage type which is used by Python internally as the basis for holding Unicode ordinals. Extension module developers should make no assumptions about the size of this type on any given platform."
But people want to know "Is Python's Unicode 16-bit or 32-bit?" So the documentation should explicitly say "it depends".
On a related note, it would be help if the documentation provided a little more background on unicode encoding. Specifically, that UCS-2 is not the same as UTF-16, even though they're both two bytes wide and most of the characters are the same. UTF-16 can encode 4 byte characters, while UCS-2 can't. A Py_UNICODE is either UCS-2 or UCS-4. It took me
I'm not sure the Python documentation is the place to teach someone about unicode. The ISO 10646 pretty clearly defines UCS-2 as only containing characters in the BMP (plane zero). On the other hand, I don't know why python lets you choose UCS-2 anyhow, since it's almost always not what you want. -- Nick

Nicholas Bastin wrote:
On May 4, 2005, at 6:20 PM, Shane Hathaway wrote:
On a related note, it would be help if the documentation provided a little more background on unicode encoding. Specifically, that UCS-2 is not the same as UTF-16, even though they're both two bytes wide and most of the characters are the same. UTF-16 can encode 4 byte characters, while UCS-2 can't. A Py_UNICODE is either UCS-2 or UCS-4. It took me
I'm not sure the Python documentation is the place to teach someone about unicode. The ISO 10646 pretty clearly defines UCS-2 as only containing characters in the BMP (plane zero). On the other hand, I don't know why python lets you choose UCS-2 anyhow, since it's almost always not what you want.
Then something in the Python docs ought to say why UCS-2 is not what you want. I still don't know; I've heard differing opinions on the subject. Some say you'll never need more than what UCS-2 provides. Is that incorrect? More generally, how should a non-unicode-expert writing Python extension code find out the minimum they need to know about unicode to use the Python unicode API? The API reference [1] ought to at least have a list of background links. I had to hunt everywhere. .. [1] http://docs.python.org/api/unicodeObjects.html Shane

Shane Hathaway wrote:
Then something in the Python docs ought to say why UCS-2 is not what you want. I still don't know; I've heard differing opinions on the subject. Some say you'll never need more than what UCS-2 provides. Is that incorrect?
That clearly depends on who "you" is.
More generally, how should a non-unicode-expert writing Python extension code find out the minimum they need to know about unicode to use the Python unicode API? The API reference [1] ought to at least have a list of background links. I had to hunt everywhere.
That, of course, depends on what your background is. Did you know what Latin-1 is, when you started? How it relates to code page 1252? What UTF-8 is? What an abstract character is, as opposed to a byte sequence on the one hand, and to a glyph on the other hand? Different people need different background, especially if they are writing different applications. Regards, Martin

Martin v. Löwis wrote:
Shane Hathaway wrote:
More generally, how should a non-unicode-expert writing Python extension code find out the minimum they need to know about unicode to use the Python unicode API? The API reference [1] ought to at least have a list of background links. I had to hunt everywhere.
That, of course, depends on what your background is. Did you know what Latin-1 is, when you started? How it relates to code page 1252? What UTF-8 is? What an abstract character is, as opposed to a byte sequence on the one hand, and to a glyph on the other hand?
Different people need different background, especially if they are writing different applications.
Yes, but the first few steps are the same for nearly everyone, and people need more help taking the first few steps. In particular: - The Python docs link to unicode.org, but unicode.org is complicated, long-winded, and leaves many questions unanswered. The Wikipedia article is far better. I wish I had thought to look there instead. http://en.wikipedia.org/wiki/Unicode - The docs should say what to expect to happen when a large unicode character winds up in a Py_UNICODE array. For instance, what is len(u'\U00012345')? 1 or 2? Does the answer depend on the UCS4 compile-time switch? - The docs should help developers evaluate whether they need the UCS4 compile-time switch. Is UCS2 good enough for Asia? For math? For hieroglyphics? <wink> Shane

Nicholas Bastin wrote:
On May 4, 2005, at 6:20 PM, Shane Hathaway wrote:
Nicholas Bastin wrote:
"This type represents the storage type which is used by Python internally as the basis for holding Unicode ordinals. Extension module developers should make no assumptions about the size of this type on any given platform."
But people want to know "Is Python's Unicode 16-bit or 32-bit?" So the documentation should explicitly say "it depends".
On a related note, it would be help if the documentation provided a little more background on unicode encoding. Specifically, that UCS-2 is not the same as UTF-16, even though they're both two bytes wide and most of the characters are the same. UTF-16 can encode 4 byte characters, while UCS-2 can't. A Py_UNICODE is either UCS-2 or UCS-4. It took me
I'm not sure the Python documentation is the place to teach someone about unicode. The ISO 10646 pretty clearly defines UCS-2 as only containing characters in the BMP (plane zero). On the other hand, I don't know why python lets you choose UCS-2 anyhow, since it's almost always not what you want.
You've got that wrong: Python let's you choose UCS-4 - UCS-2 is the default. Note that Python's Unicode codecs UTF-8 and UTF-16 are surrogate aware and thus support non-BMP code points regardless of the build type: A UCS2-build of Python will store a non-BMP code point as UTF-16 surrogate pair in the Py_UNICODE buffer while a UCS4 build will store it as a single value. Decoding is surrogate aware too, so a UTF-16 surrogate pair in a UCS2 build will get treated as single Unicode code point. Ideally, the Python programmer should not really need to know all this and I think we've achieved that up to certain point (Unicode can be complicated - there's nothing to hide there). However, the C progammer using the Python C API to interface to some other Unicode implementation will need to know these details. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 06 2005)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

On May 6, 2005, at 3:17 AM, M.-A. Lemburg wrote:
You've got that wrong: Python let's you choose UCS-4 - UCS-2 is the default.
Note that Python's Unicode codecs UTF-8 and UTF-16 are surrogate aware and thus support non-BMP code points regardless of the build type: A UCS2-build of Python will store a non-BMP code point as UTF-16 surrogate pair in the Py_UNICODE buffer while a UCS4 build will store it as a single value. Decoding is surrogate aware too, so a UTF-16 surrogate pair in a UCS2 build will get treated as single Unicode code point.
If this is the case, then we're clearly misleading users. If the configure script says UCS-2, then as a user I would assume that surrogate pairs would *not* be encoded, because I chose UCS-2, and it doesn't support that. I would assume that any UTF-16 string I would read would be transcoded into the internal type (UCS-2), and information would be lost. If this is not the case, then what does the configure option mean? -- Nick

On May 6, 2005, at 2:49 PM, Nicholas Bastin wrote:
If this is the case, then we're clearly misleading users. If the configure script says UCS-2, then as a user I would assume that surrogate pairs would *not* be encoded, because I chose UCS-2, and it doesn't support that. I would assume that any UTF-16 string I would read would be transcoded into the internal type (UCS-2), and information would be lost. If this is not the case, then what does the configure option mean?
It means all the string operations treat strings as if they were UCS-2, but that in actuality, they are UTF-16. Same as the case in the windows APIs and Java. That is, all string operations are essentially broken, because they're operating on encoded bytes, not characters, but claim to be operating on characters. James

After reading through the code and the comments in this thread, I propose the following in the documentation as the definition of Py_UNICODE: "This type represents the storage type which is used by Python internally as the basis for holding Unicode ordinals. Extension module developers should make no assumptions about the size or native encoding of this type on any given platform." The main point here is that extension developers can not safely slam Py_UNICODE (which it appeared was true when the documentation stated that it was always 16-bits). I don't propose that we put this information in the doc, but the possible internal representations are: 2-byte wchar_t or unsigned short encoded as UTF-16 4-byte wchar_t encoded as UTF-32 (UCS-4) If you do not explicitly set the configure option, you cannot guarantee which you will get. Python also does not normalize the byte order of unicode strings passed into it from C (via PyUnicode_EncodeUTF16, for example), so it is possible to have UTF-16LE and UTF-16BE strings in the system at the same time, which is a bit confusing. This may or may not be worth a mention in the doc (or a patch). -- Nick

On May 6, 2005, at 3:42 PM, James Y Knight wrote:
On May 6, 2005, at 2:49 PM, Nicholas Bastin wrote:
If this is the case, then we're clearly misleading users. If the configure script says UCS-2, then as a user I would assume that surrogate pairs would *not* be encoded, because I chose UCS-2, and it doesn't support that. I would assume that any UTF-16 string I would read would be transcoded into the internal type (UCS-2), and information would be lost. If this is not the case, then what does the configure option mean?
It means all the string operations treat strings as if they were UCS-2, but that in actuality, they are UTF-16. Same as the case in the windows APIs and Java. That is, all string operations are essentially broken, because they're operating on encoded bytes, not characters, but claim to be operating on characters.
Well, this is a completely separate issue/problem. The internal representation is UTF-16, and should be stated as such. If the built-in methods actually don't work with surrogate pairs, then that should be fixed. -- Nick

Nicholas Bastin wrote:
On May 6, 2005, at 3:42 PM, James Y Knight wrote:
It means all the string operations treat strings as if they were UCS-2, but that in actuality, they are UTF-16. Same as the case in the windows APIs and Java. That is, all string operations are essentially broken, because they're operating on encoded bytes, not characters, but claim to be operating on characters.
Well, this is a completely separate issue/problem. The internal representation is UTF-16, and should be stated as such. If the built-in methods actually don't work with surrogate pairs, then that should be fixed.
Wait... are you saying a Py_UNICODE array contains either UTF-16 or UTF-32 characters, but never UCS-2? That's a big surprise to me. I may need to change my PyXPCOM patch to fit this new understanding. I tried hard to not care how Python encodes unicode characters, but details like this are important when combining two frameworks with different unicode APIs. Shane

On May 6, 2005, at 5:21 PM, Shane Hathaway wrote:
Nicholas Bastin wrote:
On May 6, 2005, at 3:42 PM, James Y Knight wrote:
It means all the string operations treat strings as if they were UCS-2, but that in actuality, they are UTF-16. Same as the case in the windows APIs and Java. That is, all string operations are essentially broken, because they're operating on encoded bytes, not characters, but claim to be operating on characters.
Well, this is a completely separate issue/problem. The internal representation is UTF-16, and should be stated as such. If the built-in methods actually don't work with surrogate pairs, then that should be fixed.
Wait... are you saying a Py_UNICODE array contains either UTF-16 or UTF-32 characters, but never UCS-2? That's a big surprise to me. I may need to change my PyXPCOM patch to fit this new understanding. I tried hard to not care how Python encodes unicode characters, but details like this are important when combining two frameworks with different unicode APIs.
Yes. Well, in as much as a large part of UTF-16 directly overlaps UCS-2, then sometimes unicode strings contain UCS-2 characters. However, characters which would not be legal in UCS-2 are still encoded properly in python, in UTF-16. And yes, I feel your pain, that's how I *got* into this position. Mapping from external unicode types is an important aspect of writing extension modules, and the documentation does not help people trying to do this. The fact that python's internal encoding is variable is a huge problem in and of itself, even if that was documented properly. This is why tools like Xerces and ICU will be happy to give you whatever form of unicode strings you want, but internally they always use UTF-16 - to avoid having to write two internal implementations of the same functionality. If you look up and down Objects/unicodeobject.c you'll see a fair amount of code written a couple of different ways (using #ifdef's) because of the variability in the internal representation. -- Nick

Nicholas Bastin wrote:
On May 6, 2005, at 5:21 PM, Shane Hathaway wrote:
Wait... are you saying a Py_UNICODE array contains either UTF-16 or UTF-32 characters, but never UCS-2? That's a big surprise to me. I may need to change my PyXPCOM patch to fit this new understanding. I tried hard to not care how Python encodes unicode characters, but details like this are important when combining two frameworks with different unicode APIs.
Yes. Well, in as much as a large part of UTF-16 directly overlaps UCS-2, then sometimes unicode strings contain UCS-2 characters. However, characters which would not be legal in UCS-2 are still encoded properly in python, in UTF-16.
And yes, I feel your pain, that's how I *got* into this position. Mapping from external unicode types is an important aspect of writing extension modules, and the documentation does not help people trying to do this. The fact that python's internal encoding is variable is a huge problem in and of itself, even if that was documented properly. This is why tools like Xerces and ICU will be happy to give you whatever form of unicode strings you want, but internally they always use UTF-16 - to avoid having to write two internal implementations of the same functionality. If you look up and down Objects/unicodeobject.c you'll see a fair amount of code written a couple of different ways (using #ifdef's) because of the variability in the internal representation.
Ok. Thanks for helping me understand where Python is WRT unicode. I can work around the issues (or maybe try to help solve them) now that I know the current state of affairs. If Python correctly handled UTF-16 strings internally, we wouldn't need the UCS-4 configuration switch, would we? Shane

On May 6, 2005, at 7:05 PM, Shane Hathaway wrote:
Nicholas Bastin wrote:
On May 6, 2005, at 5:21 PM, Shane Hathaway wrote:
Wait... are you saying a Py_UNICODE array contains either UTF-16 or UTF-32 characters, but never UCS-2? That's a big surprise to me. I may need to change my PyXPCOM patch to fit this new understanding. I tried hard to not care how Python encodes unicode characters, but details like this are important when combining two frameworks with different unicode APIs.
Yes. Well, in as much as a large part of UTF-16 directly overlaps UCS-2, then sometimes unicode strings contain UCS-2 characters. However, characters which would not be legal in UCS-2 are still encoded properly in python, in UTF-16.
And yes, I feel your pain, that's how I *got* into this position. Mapping from external unicode types is an important aspect of writing extension modules, and the documentation does not help people trying to do this. The fact that python's internal encoding is variable is a huge problem in and of itself, even if that was documented properly. This is why tools like Xerces and ICU will be happy to give you whatever form of unicode strings you want, but internally they always use UTF-16 - to avoid having to write two internal implementations of the same functionality. If you look up and down Objects/unicodeobject.c you'll see a fair amount of code written a couple of different ways (using #ifdef's) because of the variability in the internal representation.
Ok. Thanks for helping me understand where Python is WRT unicode. I can work around the issues (or maybe try to help solve them) now that I know the current state of affairs. If Python correctly handled UTF-16 strings internally, we wouldn't need the UCS-4 configuration switch, would we?
Personally I would rather see Python (3000) grow a new way to represent strings, more along the lines of the way it's typically done in Objective-C. I wrote a little bit about that works here: http://bob.pythonmac.org/archives/2005/04/04/pyobjc-and-unicode/ Effectively, instead of having One And Only One Way To Store Text, you would have one and only one base class (say basestring) that has some "virtual" methods that know how to deal with text. Then, you have several concrete implementations that implements those functions for its particular backing store (and possibly encoding, but that might be implicit with the backing store.. i.e. if its an ASCII, UCS-2 or UCS-4 backing store). Currently we more or less have this at the Python level, between str and unicode, but certainly not at the C API. -bob

Shane Hathaway wrote:
Ok. Thanks for helping me understand where Python is WRT unicode. I can work around the issues (or maybe try to help solve them) now that I know the current state of affairs. If Python correctly handled UTF-16 strings internally, we wouldn't need the UCS-4 configuration switch, would we?
Define correctly. Python, in ucs2 mode, will allow to address individual surrogate codes, e.g. in indexing. So you get
u"\U00012345"[0] u'\ud808'
This will never work "correctly", and never should, because an efficient implementation isn't possible. If you want "safe" indexing and slicing, you need ucs4. Regards, Martin

Martin v. Löwis wrote:
Define correctly. Python, in ucs2 mode, will allow to address individual surrogate codes, e.g. in indexing. So you get
u"\U00012345"[0]
When Python encodes characters internally in UCS-2, I would expect u"\U00012345" to produce a UnicodeError("character can not be encoded in UCS-2").
u'\ud808'
This will never work "correctly", and never should, because an efficient implementation isn't possible. If you want "safe" indexing and slicing, you need ucs4.
I agree that UCS4 is needed. There is a balancing act here; UTF-16 is widely used and takes less space, while UCS4 is easier to treat as an array of characters. Maybe we can have both: unicode objects start with an internal representation in UTF-16, but get promoted automatically to UCS4 when you index or slice them. The difference will not be visible to Python code. A compile-time switch will not be necessary. What do you think? Shane

Shane Hathaway wrote:
I agree that UCS4 is needed. There is a balancing act here; UTF-16 is widely used and takes less space, while UCS4 is easier to treat as an array of characters. Maybe we can have both: unicode objects start with an internal representation in UTF-16, but get promoted automatically to UCS4 when you index or slice them. The difference will not be visible to Python code. A compile-time switch will not be necessary. What do you think?
This breaks backwards compatibility with existing extension modules. Applications that do PyUnicode_AsUnicode get a Py_UNICODE*, and can use that to directly access the characters. Regards, Martin

Martin v. Löwis wrote:
Shane Hathaway wrote:
I agree that UCS4 is needed. There is a balancing act here; UTF-16 is widely used and takes less space, while UCS4 is easier to treat as an array of characters. Maybe we can have both: unicode objects start with an internal representation in UTF-16, but get promoted automatically to UCS4 when you index or slice them. The difference will not be visible to Python code. A compile-time switch will not be necessary. What do you think?
This breaks backwards compatibility with existing extension modules. Applications that do PyUnicode_AsUnicode get a Py_UNICODE*, and can use that to directly access the characters.
Py_UNICODE would always be 32 bits wide. PyUnicode_AsUnicode would cause the unicode object to be promoted automatically. Extensions that break as a result are technically broken already, aren't they? They're not supposed to depend on the size of Py_UNICODE. Shane

Shane Hathaway wrote:
Py_UNICODE would always be 32 bits wide.
This would break PythonWin, which relies on Py_UNICODE being the same as WCHAR_T. PythonWin is not broken, it just hasn't been ported to UCS-4, yet (and porting this is difficult and will cause a performance loss). Regards, Martin

Shane Hathaway wrote:
Martin v. Löwis wrote:
Shane Hathaway wrote:
I agree that UCS4 is needed. There is a balancing act here; UTF-16 is widely used and takes less space, while UCS4 is easier to treat as an array of characters. Maybe we can have both: unicode objects start with an internal representation in UTF-16, but get promoted automatically to UCS4 when you index or slice them. The difference will not be visible to Python code. A compile-time switch will not be necessary. What do you think?
This breaks backwards compatibility with existing extension modules. Applications that do PyUnicode_AsUnicode get a Py_UNICODE*, and can use that to directly access the characters.
Py_UNICODE would always be 32 bits wide. PyUnicode_AsUnicode would cause the unicode object to be promoted automatically. Extensions that break as a result are technically broken already, aren't they? They're not supposed to depend on the size of Py_UNICODE.
-1. You are free to compile Python with --enable-unicode=ucs4 if you prefer this setting. I don't see any reason why we should force users to invest 4 bytes of storage for each Unicode code point - 2 bytes work just fine and can represent all Unicode characters that are currently defined (using surrogates if necessary). As more and more Unicode objects are used in a process, choosing UCS2 vs. UCS4 does make a huge difference in terms of used memory. All this talk about UTF-16 vs. UCS-2 is not very useful and strikes me a purely academic. The reference to possibly breakage by slicing a Unicode and breaking a surrogate pair is valid, the idea of UCS-4 being less prone to breakage is a myth: Unicode has many code points that are meant only for composition and don't have any standalone meaning, e.g. a combining acute accent (U+0301), yet they are perfectly valid code points - regardless of UCS-2 or UCS-4. It is easily possible to break such a combining sequence using slicing, so the most often presented argument for using UCS-4 instead of UCS-2 (+ surrogates) is rather weak if seen by daylight. Some may now say that combining sequences are not used all that often. However, they play a central role in Unicode normalization (http://www.unicode.org/reports/tr15/), which is needed whenever you want to semantically compare Unicode objects and are -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 07 2005)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

M.-A. Lemburg wrote:
Unicode has many code points that are meant only for composition and don't have any standalone meaning, e.g. a combining acute accent (U+0301), yet they are perfectly valid code points - regardless of UCS-2 or UCS-4. It is easily possible to break such a combining sequence using slicing, so the most often presented argument for using UCS-4 instead of UCS-2 (+ surrogates) is rather weak if seen by daylight.
I disagree. It is not just about slicing, it is also about searching for a character (either through the "in" operator, or through regular expressions). If you define an SRE character class, such a character class cannot hold a non-BMP character in UTF-16 mode, but it can in UCS-4 mode. Consequently, implementing XML's lexical classes (such as Name, NCName, etc.) is much easier in UCS-4 than it is in UCS-2. In this case, combining characters do not matter much, because the XML spec is defined in terms of Unicode coded characters, causing combining characters to appear as separate entities for lexical purposes (unlike half surrogates). Regards, Martin

Martin v. Löwis wrote:
M.-A. Lemburg wrote:
Unicode has many code points that are meant only for composition and don't have any standalone meaning, e.g. a combining acute accent (U+0301), yet they are perfectly valid code points - regardless of UCS-2 or UCS-4. It is easily possible to break such a combining sequence using slicing, so the most often presented argument for using UCS-4 instead of UCS-2 (+ surrogates) is rather weak if seen by daylight.
I disagree. It is not just about slicing, it is also about searching for a character (either through the "in" operator, or through regular expressions). If you define an SRE character class, such a character class cannot hold a non-BMP character in UTF-16 mode, but it can in UCS-4 mode. Consequently, implementing XML's lexical classes (such as Name, NCName, etc.) is much easier in UCS-4 than it is in UCS-2. In this case, combining characters do not matter much, because the XML spec is defined in terms of Unicode coded characters, causing combining characters to appear as separate entities for lexical purposes (unlike half surrogates).
Searching for a character is possible in UCS2 as well - even for surrogates with "in" now supporting multiple code point searches:
len(u'\U00010000') 2 u'\U00010000' in u'\U00010001\U00010002\U00010000 and some extra stuff' True u'\U00010000' in u'\U00010001\U00010002\U00010003 and some extra stuff' False
On sre character classes: I don't think that these provide a good approach to XML lexical classes - custom functions or methods or maybe even a codec mapping the characters to their XML lexical class are much more efficient in practice. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 09 2005)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

M.-A. Lemburg wrote:
On sre character classes: I don't think that these provide a good approach to XML lexical classes - custom functions or methods or maybe even a codec mapping the characters to their XML lexical class are much more efficient in practice.
That isn't my experience: functions that scan XML strings are much slower than regular expressions. I can't imagine how a custom codec could work, so I cannot comment on that. Regards, Martin

Martin v. Löwis wrote:
M.-A. Lemburg wrote:
On sre character classes: I don't think that these provide a good approach to XML lexical classes - custom functions or methods or maybe even a codec mapping the characters to their XML lexical class are much more efficient in practice.
That isn't my experience: functions that scan XML strings are much slower than regular expressions. I can't imagine how a custom codec could work, so I cannot comment on that.
If all you're interested in is the lexical class of the code points in a string, you could use such a codec to map each code point to a code point representing the lexical class. Then run re as usual on the mapped Unicode string. Since the indices of the matches found in the resulting string will be the same as in the original string, it's easy to extract the corresponding data from the original string. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 10 2005)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

M.-A. Lemburg wrote:
If all you're interested in is the lexical class of the code points in a string, you could use such a codec to map each code point to a code point representing the lexical class.
How can I efficiently implement such a codec? The whole point is doing that in pure Python (because if I had to write an extension module, I could just as well do the entire lexical analysis in C, without any regular expressions). Any kind of associative/indexed table for this task consumes a lot of memory, and takes quite some time to initialize. Regards, Martint

Martin v. Löwis wrote:
M.-A. Lemburg wrote:
If all you're interested in is the lexical class of the code points in a string, you could use such a codec to map each code point to a code point representing the lexical class.
How can I efficiently implement such a codec? The whole point is doing that in pure Python (because if I had to write an extension module, I could just as well do the entire lexical analysis in C, without any regular expressions).
You can write such a codec in Python, but C will of course be more efficient. The whole point is that for things that you will likely use a lot in your application, it is better to have one efficient implementation than dozens of duplicate re character sets embedded in compiled re-expressions.
Any kind of associative/indexed table for this task consumes a lot of memory, and takes quite some time to initialize.
Right - which is why an algorithmic approach will always be more efficient (in terms of speed/memory tradeoff) and these *can* support surrogates. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 11 2005)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

M.-A. Lemburg wrote:
All this talk about UTF-16 vs. UCS-2 is not very useful and strikes me a purely academic.
The reference to possibly breakage by slicing a Unicode and breaking a surrogate pair is valid, the idea of UCS-4 being less prone to breakage is a myth:
Fair enough. The original point is that the documentation is unclear about what a Py_UNICODE[] contains. I deduced that it contains either UCS2 or UCS4 and implemented accordingly. Not only did I guess wrong, but others will probably guess wrong too. Something in the docs needs to spell this out. Shane

Shane Hathaway wrote:
Fair enough. The original point is that the documentation is unclear about what a Py_UNICODE[] contains. I deduced that it contains either UCS2 or UCS4 and implemented accordingly. Not only did I guess wrong, but others will probably guess wrong too. Something in the docs needs to spell this out.
Again, patches are welcome. I was opposed to Nick's proposed changes, since they explicitly said that you are not supposed to know what is in a Py_UNICODE. Integrating the essence of PEP 261 into the main documentation would be a worthwhile task. Regards, Martin

On May 8, 2005, at 1:44 PM, Martin v. Löwis wrote:
Shane Hathaway wrote:
Fair enough. The original point is that the documentation is unclear about what a Py_UNICODE[] contains. I deduced that it contains either UCS2 or UCS4 and implemented accordingly. Not only did I guess wrong, but others will probably guess wrong too. Something in the docs needs to spell this out.
Again, patches are welcome. I was opposed to Nick's proposed changes, since they explicitly said that you are not supposed to know what is in a Py_UNICODE. Integrating the essence of PEP 261 into the main documentation would be a worthwhile task.
You can't possibly assume you know specifically what's in a Py_UNICODE in any given python installation. If someone thinks this statement is untrue, please explain why. I realize you might not *want* that to be true, but it is. Users are free to configure their python however they desire, and if that means --enable-unicode=ucs2 on RH9, then that is perfectly valid. -- Nick

Nicholas Bastin wrote:
Again, patches are welcome. I was opposed to Nick's proposed changes, since they explicitly said that you are not supposed to know what is in a Py_UNICODE. Integrating the essence of PEP 261 into the main documentation would be a worthwhile task.
You can't possibly assume you know specifically what's in a Py_UNICODE in any given python installation. If someone thinks this statement is untrue, please explain why.
This is a different issue. Between saying "we don't know what installation xyz uses" and saying "we cannot say anything" is a wide range of things that you can truthfully say. Like "it can be either two bytes or four bytes" (but not one or three bytes), and so on. Also, for a given installation, you can find out by looking at sys.maxunicode from Python, or at Py_UNICODE_SIZE from C.
I realize you might not *want* that to be true, but it is. Users are free to configure their python however they desire, and if that means --enable-unicode=ucs2 on RH9, then that is perfectly valid.
Sure they can. Of course, that will mean they don't get a working _tkinter, unless they rebuild Tcl as well. Nevertheless, it is indeed likely that people do that. So if you want to support them, you need to distribute two versions of your binary module, or give them source code. Regards, Martin

Nicholas Bastin wrote:
Well, this is a completely separate issue/problem. The internal representation is UTF-16, and should be stated as such. If the built-in methods actually don't work with surrogate pairs, then that should be fixed.
Yes to the former, no to the latter. PEP 261 specifies what should and shouldn't work. Regards, Martin

On May 6, 2005, at 8:11 PM, Martin v. Löwis wrote:
Nicholas Bastin wrote:
Well, this is a completely separate issue/problem. The internal representation is UTF-16, and should be stated as such. If the built-in methods actually don't work with surrogate pairs, then that should be fixed.
Yes to the former, no to the latter. PEP 261 specifies what should and shouldn't work.
This PEP has several textual errors and ambiguities (which, admittedly, may have been a necessary state given the unicode standard in 2001). However, putting that aside, I would recommend that: --enable-unicode=ucs2 be replaced with: --enable-unicode=utf16 and the docs be updated to reflect more accurately the variance of the internal storage type. I would also like the community to strongly consider standardizing on a single internal representation, but I will leave that fight for another day. -- Nick

Nicholas Bastin wrote:
--enable-unicode=ucs2
be replaced with:
--enable-unicode=utf16
and the docs be updated to reflect more accurately the variance of the internal storage type.
-1. This breaks existing documentation and usage, and provides only minimum value. With --enable-unicode=ucs2, Python's Py_UNICODE does *not* start supporting the full Unicode ccs the same way it supports UCS-2. Individual surrogate values remain accessible, and supporting non-BMP characters is left to the application (with the exception of the UTF-8 codec). Regards, Martin

On May 7, 2005, at 9:29 AM, Martin v. Löwis wrote:
Nicholas Bastin wrote:
--enable-unicode=ucs2
be replaced with:
--enable-unicode=utf16
and the docs be updated to reflect more accurately the variance of the internal storage type.
-1. This breaks existing documentation and usage, and provides only minimum value.
Have you been missing this conversation? UTF-16 is *WHAT PYTHON CURRENTLY IMPLEMENTS*. The current documentation is flat out wrong. Breaking that isn't a big problem in my book. It provides more than minimum value - it provides the truth.
With --enable-unicode=ucs2, Python's Py_UNICODE does *not* start supporting the full Unicode ccs the same way it supports UCS-2. Individual surrogate values remain accessible, and supporting non-BMP characters is left to the application (with the exception of the UTF-8 codec).
I can't understand what you mean by this. My point is that if you configure python to support UCS-2, then it SHOULD NOT support surrogate pairs. Supporting surrogate paris is the purvey of variable width encodings, and UCS-2 is not among them. -- Nick

Nicholas Bastin wrote:
On May 7, 2005, at 9:29 AM, Martin v. Löwis wrote:
With --enable-unicode=ucs2, Python's Py_UNICODE does *not* start supporting the full Unicode ccs the same way it supports UCS-2. Individual surrogate values remain accessible, and supporting non-BMP characters is left to the application (with the exception of the UTF-8 codec).
I can't understand what you mean by this. My point is that if you configure python to support UCS-2, then it SHOULD NOT support surrogate pairs. Supporting surrogate paris is the purvey of variable width encodings, and UCS-2 is not among them.
Surrogate pairs are only supported by the UTF-8 and UTF-16 codecs (and a few others), not the Python Unicode implementation itself - this treats surrogate code points just like any other Unicode code point. This allows us to be flexible and efficient in the implementation while guaranteeing the round-trip safety of Unicode data processed through Python. Your complaint about the documentation (which started this thread) is valid. However, I don't understand all the excitement about Py_UNICODE: if you don't like the way this Python typedef works, you are free to interface to Python using any of the supported encodings using PyUnicode_Encode() and PyUnicode_Decode(). I'm sure you'll find one that fits your needs and if not, you can even write your own codec and register it with Python, e.g. UTF-32 which we currently don't support ;-) Please upload your doc-patch to SF. Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 07 2005)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

On May 7, 2005, at 5:09 PM, M.-A. Lemburg wrote:
However, I don't understand all the excitement about Py_UNICODE: if you don't like the way this Python typedef works, you are free to interface to Python using any of the supported encodings using PyUnicode_Encode() and PyUnicode_Decode(). I'm sure you'll find one that fits your needs and if not, you can even write your own codec and register it with Python, e.g. UTF-32 which we currently don't support ;-)
My concerns about Py_UNICODE are completely separate from my frustration that the documentation is wrong about this type. It is much more important that the documentation be correct, first, and then we can discuss the reasons why it can be one of two values, rather than just a uniform value across all python implementations. This makes distributing binary extension modules hard. It has become clear to me that no one on this list gives a *%&^ about people attempting to distribute binary extension modules, or they would have cared about this problem, so I'll just drop that point. However, somehow, what keeps getting lost in the mix is that --enable-unicode=ucs2 is a lie, and we should change what this configure option says. Martin seems to disagree with me, for reasons that I don't understand. I would be fine with calling the option utf16, or just 2 and 4, but not ucs2, as that means things that Python doesn't intend it to mean. -- Nick

Nicholas Bastin wrote:
On May 7, 2005, at 5:09 PM, M.-A. Lemburg wrote:
However, I don't understand all the excitement about Py_UNICODE: if you don't like the way this Python typedef works, you are free to interface to Python using any of the supported encodings using PyUnicode_Encode() and PyUnicode_Decode(). I'm sure you'll find one that fits your needs and if not, you can even write your own codec and register it with Python, e.g. UTF-32 which we currently don't support ;-)
My concerns about Py_UNICODE are completely separate from my frustration that the documentation is wrong about this type. It is much more important that the documentation be correct, first, and then we can discuss the reasons why it can be one of two values, rather than just a uniform value across all python implementations. This makes distributing binary extension modules hard. It has become clear to me that no one on this list gives a *%&^ about people attempting to distribute binary extension modules, or they would have cared about this problem, so I'll just drop that point.
Actually, many of us know about the problem of having to ship UCS2 and UCS4 builds of binary extensions and the troubles this causes with users. It just adds one more dimension to the number of builds you have to make - one for the Python version, another for the platform and in the case of Linux another one for the Unicode width. Nowadays most Linux distros ship UCS4 builds (after RedHat started this quest), so things start to normalize again.
However, somehow, what keeps getting lost in the mix is that --enable-unicode=ucs2 is a lie, and we should change what this configure option says. Martin seems to disagree with me, for reasons that I don't understand. I would be fine with calling the option utf16, or just 2 and 4, but not ucs2, as that means things that Python doesn't intend it to mean.
It's not a lie: the Unicode implementation does work with UCS2 code points (surrogate values are Unicode code points as well - they happen to live in a special zone of the BMP). Only the codecs add support for surrogates in a way that allows round-trip safety regardless of whether you used UCS2 or UCS4 as compile time option. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 07 2005)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

On May 7, 2005, at 5:09 PM, M.-A. Lemburg wrote:
Please upload your doc-patch to SF.
All of my proposals for what to change the documention to have been shot down by Martin. If someone has better verbiage that they'd like to see, I'd be perfectly happy to patch the doc. My last suggestion was: "This type represents the storage type which is used by Python internally as the basis for holding Unicode ordinals. Extension module developers should make no assumptions about the size of this type on any given platform." -- Nick

Nicholas Bastin wrote:
All of my proposals for what to change the documention to have been shot down by Martin. If someone has better verbiage that they'd like to see, I'd be perfectly happy to patch the doc.
I don't look into the specific wording - you speak English much better than I do. What I care about is that this part of the documentation should be complete and precise. I.e. statements like "should not make assumptions" might be fine, as long as they are still followed by a precise description of what the code currently does. So it should mention that the representation can be either 2 or 4 bytes, that the strings "ucs2" and "ucs4" can be used to select one of them, that it is always 2 bytes on Windows, that 2 bytes means that non-BMP characters can be represented as surrogate pairs, and so on. Regards, Martin

On May 8, 2005, at 5:28 AM, Martin v. Löwis wrote:
Nicholas Bastin wrote:
All of my proposals for what to change the documention to have been shot down by Martin. If someone has better verbiage that they'd like to see, I'd be perfectly happy to patch the doc.
I don't look into the specific wording - you speak English much better than I do. What I care about is that this part of the documentation should be complete and precise. I.e. statements like "should not make assumptions" might be fine, as long as they are still followed by a precise description of what the code currently does. So it should mention that the representation can be either 2 or 4 bytes, that the strings "ucs2" and "ucs4" can be used to select one of them, that it is always 2 bytes on Windows, that 2 bytes means that non-BMP characters can be represented as surrogate pairs, and so on.
It's not always 2 bytes on Windows. Users can alter the config options (and not unreasonably so, btw, on 64-bit windows platforms). This goes to the issue that I think people don't understand that we have to assume that some users will build their own Python. This will result in 2-byte Python's on RHL9, and 4-byte python's on windows, both of which have already been claimed in this discussion to not happen, which is untrue. You can't build a binary extension module on windows and assume that Py_UNICODE is 2 bytes, because that's not enforced in any way. The same is true for 4-byte Py_UNICODE on RHL9. -- Nick

Nicholas Bastin wrote:
It's not always 2 bytes on Windows. Users can alter the config options (and not unreasonably so, btw, on 64-bit windows platforms).
Did you try that? I'm not sure it even builds when you do so, but if it does, you will lose the "mbcs" codec, and the ability to use Unicode strings as file names. Without the "mbcs" codec, I would expect that quite a lot of the Unicode stuff breaks.
You can't build a binary extension module on windows and assume that Py_UNICODE is 2 bytes, because that's not enforced in any way. The same is true for 4-byte Py_UNICODE on RHL9.
Depends on how much force you want to see. That the official pydotorg Windows installer python24.dll uses a 2-byte Unicode, and that a lot of things break if you change Py_UNICODE to four bytes on Windows (including PythonWin) is a pretty strong guarantee that you won't see a Windows Python build with UCS-4 for quite some time. Regards, Martin

Nicholas Bastin wrote:
-1. This breaks existing documentation and usage, and provides only minimum value.
Have you been missing this conversation? UTF-16 is *WHAT PYTHON CURRENTLY IMPLEMENTS*. The current documentation is flat out wrong. Breaking that isn't a big problem in my book.
The documentation I refer to is the one that says the equivalent of 'configure takes an option --enable-unicode, with the possible values "ucs2", "ucs4", "yes" (equivalent to no argument), and "no" (equivalent to --disable-unicode)' *THIS* documentation would break. This documentation is factually correct at the moment (configure does indeed take these options), and people rely on them in automatic build processes. Changing configure options should not be taken lightly, even if they may result from a "wrong mental model". By that rule, --with-suffix should be renamed to --enable-suffix, --with-doc-strings to --enable-doc-strings, and so on. However, the nitpicking that underlies the desire to rename the option should be ignored in favour of backwards compatibility. Changing the documentation that goes along with the option would be fine.
It provides more than minimum value - it provides the truth.
No. It is just a command line option. It could be named --enable-quirk=(quork|quark), and would still select UTF-16. Command line options provide no truth - they don't even provide statements.
With --enable-unicode=ucs2, Python's Py_UNICODE does *not* start supporting the full Unicode ccs the same way it supports UCS-2.
I can't understand what you mean by this. My point is that if you configure python to support UCS-2, then it SHOULD NOT support surrogate pairs. Supporting surrogate paris is the purvey of variable width encodings, and UCS-2 is not among them.
So you suggest to renaming it to --enable-unicode=utf16, right? My point is that a Unicode type with UTF-16 would correctly support all assigned Unicode code points, which the current 2-byte implementation doesn't. So --enable-unicode=utf16 would *not* be the truth. Regards, Martin

On May 8, 2005, at 5:15 AM, Martin v. Löwis wrote:
'configure takes an option --enable-unicode, with the possible values "ucs2", "ucs4", "yes" (equivalent to no argument), and "no" (equivalent to --disable-unicode)'
*THIS* documentation would break. This documentation is factually correct at the moment (configure does indeed take these options), and people rely on them in automatic build processes. Changing configure options should not be taken lightly, even if they may result from a "wrong mental model". By that rule, --with-suffix should be renamed to --enable-suffix, --with-doc-strings to --enable-doc-strings, and so on. However, the nitpicking that underlies the desire to rename the option should be ignored in favour of backwards compatibility.
Changing the documentation that goes along with the option would be fine.
That is exactly what I proposed originally, which you shot down. Please actually read the contents of my messages. What I said was "change the configure option and related documentation".
It provides more than minimum value - it provides the truth.
No. It is just a command line option. It could be named --enable-quirk=(quork|quark), and would still select UTF-16. Command line options provide no truth - they don't even provide statements.
Wow, what an inane way of looking at it. I don't know what world you live in, but in my world, users read the configure options and suppose that they mean something. In fact, they *have* to go off on their own to assume something, because even the documentation you refer to above doesn't say what happens if they choose UCS-2 or UCS-4. A logical assumption would be that python would use those CEFs internally, and that would be incorrect.
With --enable-unicode=ucs2, Python's Py_UNICODE does *not* start supporting the full Unicode ccs the same way it supports UCS-2.
I can't understand what you mean by this. My point is that if you configure python to support UCS-2, then it SHOULD NOT support surrogate pairs. Supporting surrogate paris is the purvey of variable width encodings, and UCS-2 is not among them.
So you suggest to renaming it to --enable-unicode=utf16, right? My point is that a Unicode type with UTF-16 would correctly support all assigned Unicode code points, which the current 2-byte implementation doesn't. So --enable-unicode=utf16 would *not* be the truth.
The current implementation supports the UTF-16 CEF. i.e., it supports a variable width encoding form capable of representing all of the unicode space using surrogate pairs. Please point out a code point that the current 2 byte implementation does not support, either directly, or through the use of surrogate pairs. -- Nick

Nicholas Bastin wrote:
Changing the documentation that goes along with the option would be fine.
That is exactly what I proposed originally, which you shot down. Please actually read the contents of my messages. What I said was "change the configure option and related documentation".
What I mean is "change just the documentation, do not change the configure option". This seems to be different from your proposal, which I understand as "change both the configure option and the documentation".
Wow, what an inane way of looking at it. I don't know what world you live in, but in my world, users read the configure options and suppose that they mean something. In fact, they *have* to go off on their own to assume something, because even the documentation you refer to above doesn't say what happens if they choose UCS-2 or UCS-4. A logical assumption would be that python would use those CEFs internally, and that would be incorrect.
Certainly. That's why the documentation should be improved. Changing the option breaks existing packaging systems, and should not be done lightly.
The current implementation supports the UTF-16 CEF. i.e., it supports a variable width encoding form capable of representing all of the unicode space using surrogate pairs. Please point out a code point that the current 2 byte implementation does not support, either directly, or through the use of surrogate pairs.
Try to match regular expression classes for non-BMP characters:
re.match(u"[\u1234]",u"\u1234").group() u'\u1234'
works fine, but
re.match(u"[\U00011234]",u"\U00011234").group() u'\ud804'
gives strange results. Regards, Martin

On May 9, 2005, at 12:59 AM, Martin v. Löwis wrote:
Wow, what an inane way of looking at it. I don't know what world you live in, but in my world, users read the configure options and suppose that they mean something. In fact, they *have* to go off on their own to assume something, because even the documentation you refer to above doesn't say what happens if they choose UCS-2 or UCS-4. A logical assumption would be that python would use those CEFs internally, and that would be incorrect.
Certainly. That's why the documentation should be improved. Changing the option breaks existing packaging systems, and should not be done lightly.
I'm perfectly happy to continue supporting --enable-unicode=ucs2, but not displaying it as an option. Is that acceptable to you? -- Nick

Nicholas Bastin wrote:
I'm perfectly happy to continue supporting --enable-unicode=ucs2, but not displaying it as an option. Is that acceptable to you?
It is. Somewhere, the code should say that this is for backwards compatibility, of course (so people won't remove it too easily; if there is a plan for obsoleting this setting, it should be done in a phased manner). Regards, Martin

On May 10, 2005, at 2:48 PM, Nicholas Bastin wrote:
On May 9, 2005, at 12:59 AM, Martin v. Löwis wrote:
Wow, what an inane way of looking at it. I don't know what world you live in, but in my world, users read the configure options and suppose that they mean something. In fact, they *have* to go off on their own to assume something, because even the documentation you refer to above doesn't say what happens if they choose UCS-2 or UCS-4. A logical assumption would be that python would use those CEFs internally, and that would be incorrect.
Certainly. That's why the documentation should be improved. Changing the option breaks existing packaging systems, and should not be done lightly.
I'm perfectly happy to continue supporting --enable-unicode=ucs2, but not displaying it as an option. Is that acceptable to you?
If you're going to call python's implementation UTF-16, I'd consider all these very serious deficiencies: - unicodedata doesn't work for 2-char strings containing a surrogate pairs, nor integers. Therefore it is impossible to get any data on chars > 0xFFFF. - there are no methods for determining if something is a surrogate pair and turning it into a integer codepoint. - Given that unicodedata doesn't work, I doubt also that .toupper/etc work right on surrogate pairs, although I haven't tested. - As has been noted before, the regexp engine doesn't properly treat surrogate pairs as a single unit. - Is there a method that is like unichr but that will work for codepoints > 0xFFFF? I'm sure there's more as well. I think it's a mistake to consider python to be implementing UTF-16 just because it properly encodes/ decodes surrogate pairs in the UTF-8 codec. James

On May 10, 2005, at 7:34 PM, James Y Knight wrote:
If you're going to call python's implementation UTF-16, I'd consider all these very serious deficiencies:
The --enable-unicode option declares a character encoding form (CEF), not a character encoding scheme (CES). It is unfortunate that UTF-16 is a valid option for both of these things, but supporting the CEF does not imply supporting the CES. All of your complaints would be valid if we claimed that Python supported the UTF-16 CES, but the language itself only needs to support a CEF that everyone understands how to work with. It is widely recognized, I believe, that the general level of unicode support exposed to Python users is somewhat lacking when it comes to high surrogate pairs. I'd love for us to fix that problem, or, better yet, integrate something like ICU, but this isn't that discussion.
- unicodedata doesn't work for 2-char strings containing a surrogate pairs, nor integers. Therefore it is impossible to get any data on chars > 0xFFFF. - there are no methods for determining if something is a surrogate pair and turning it into a integer codepoint. - Given that unicodedata doesn't work, I doubt also that .toupper/etc work right on surrogate pairs, although I haven't tested. - As has been noted before, the regexp engine doesn't properly treat surrogate pairs as a single unit. - Is there a method that is like unichr but that will work for codepoints > 0xFFFF?
I'm sure there's more as well. I think it's a mistake to consider python to be implementing UTF-16 just because it properly encodes/decodes surrogate pairs in the UTF-8 codec.
Users should understand (and we should write doc to help them understand), that using 2-byte wide unicode support in Python means that all operations will be done on Code Units, and not Code Points. Once you understand this, you can work with the data that is given to you, although it's certainly not as nice as what you would have come to expect from Python. (For example, you can correctly construct a regexp to find the surrogate pair you're looking for by using the constituent code units). -- Nick

Nicholas Bastin wrote:
If this is the case, then we're clearly misleading users. If the configure script says UCS-2, then as a user I would assume that surrogate pairs would *not* be encoded, because I chose UCS-2, and it doesn't support that.
What do you mean by that? That the interpreter crashes if you try to store a low surrogate into a Py_UNICODE?
I would assume that any UTF-16 string I would read would be transcoded into the internal type (UCS-2), and information would be lost. If this is not the case, then what does the configure option mean?
It tells you whether you have the two-octet form of the Universal Character Set, or the four-octet form. Regards, Martin

On May 6, 2005, at 7:43 PM, Martin v. Löwis wrote:
Nicholas Bastin wrote:
If this is the case, then we're clearly misleading users. If the configure script says UCS-2, then as a user I would assume that surrogate pairs would *not* be encoded, because I chose UCS-2, and it doesn't support that.
What do you mean by that? That the interpreter crashes if you try to store a low surrogate into a Py_UNICODE?
What I mean is pretty clear. UCS-2 does *NOT* support surrogate pairs. If it did, it would be called UTF-16. If Python really supported UCS-2, then surrogate pairs from UTF-16 inputs would either get turned into two garbage characters, or the "I couldn't transcode this" UCS-2 code point (I don't remember which on that is off the top of my head).
I would assume that any UTF-16 string I would read would be transcoded into the internal type (UCS-2), and information would be lost. If this is not the case, then what does the configure option mean?
It tells you whether you have the two-octet form of the Universal Character Set, or the four-octet form.
It would, if that were the case, but it's not. Setting UCS-2 in the configure script really means UTF-16, and as such, the documentation should reflect that. -- Nick

Nicholas Bastin wrote:
What I mean is pretty clear. UCS-2 does *NOT* support surrogate pairs. If it did, it would be called UTF-16. If Python really supported UCS-2, then surrogate pairs from UTF-16 inputs would either get turned into two garbage characters, or the "I couldn't transcode this" UCS-2 code point (I don't remember which on that is off the top of my head).
OTOH, if Python really supported UTF-16, then unichr(0x10000) would work, and len(u"\U00010000") would be 1. It is primarily just the UTF-8 codec which supports UTF-16. Regards, Martin

On May 6, 2005, at 3:17 AM, M.-A. Lemburg wrote:
You've got that wrong: Python let's you choose UCS-4 - UCS-2 is the default.
No, that's not true. Python lets you choose UCS-4 or UCS-2. What the default is depends on your platform. If you run raw configure, some systems will choose UCS-4, and some will choose UCS-2. This is how the conversation came about in the first place - running ./configure on RHL9 gives you UCS-4. -- Nick

Nicholas Bastin wrote:
On May 6, 2005, at 3:17 AM, M.-A. Lemburg wrote:
You've got that wrong: Python let's you choose UCS-4 - UCS-2 is the default.
No, that's not true. Python lets you choose UCS-4 or UCS-2. What the default is depends on your platform. If you run raw configure, some systems will choose UCS-4, and some will choose UCS-2. This is how the conversation came about in the first place - running ./configure on RHL9 gives you UCS-4.
Hmm, looking at the configure.in script, it seems you're right. I wonder why this weird dependency on TCL was added. This was certainly not intended (see the comment): if test $enable_unicode = yes then # Without any arguments, Py_UNICODE defaults to two-byte mode case "$have_ucs4_tcl" in yes) enable_unicode="ucs4" ;; *) enable_unicode="ucs2" ;; esac fi The annotiation suggests that Martin added this. Martin, could you please explain why the whole *Python system* should depend on what Unicode type some installed *TCL system* is using ? I fail to see the connection. Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 06 2005)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Martin v. Löwis wrote:
M.-A. Lemburg wrote:
Hmm, looking at the configure.in script, it seems you're right. I wonder why this weird dependency on TCL was added.
If Python is configured for UCS-2, and Tcl for UCS-4, then Tkinter would not work out of the box. Hence the weird dependency.
I believe that it would be more appropriate to adjust the _tkinter module to adapt to the TCL Unicode size rather than forcing the complete Python system to adapt to TCL - I don't really see the point in an optional extension module defining the default for the interpreter core. At the very least, this should be a user controlled option. Otherwise, we might as well use sizeof(wchar_t) as basis for the default Unicode size. This at least, would be a much more reasonable choice than whatever TCL uses. - Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 07 2005)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

M.-A. Lemburg wrote:
I believe that it would be more appropriate to adjust the _tkinter module to adapt to the TCL Unicode size rather than forcing the complete Python system to adapt to TCL - I don't really see the point in an optional extension module defining the default for the interpreter core.
_tkinter currently supports, for a UCS-2 Tcl, both UCS-2 and UCS-4 Python. For an UCS-4 Tcl, it requires Python also to be UCS-4. Contributions to support the missing case are welcome.
At the very least, this should be a user controlled option.
It is: by passing --enable-unicode=ucs2, you can force Python to use UCS-2 even if Tcl is UCS-4, with the result that _tkinter cannot be built anymore (and compilation fails with an #error).
Otherwise, we might as well use sizeof(wchar_t) as basis for the default Unicode size. This at least, would be a much more reasonable choice than whatever TCL uses.
The goal of the build process is to provide as many extension modules as possible (given the set of headers and libraries installed), and _tkinter is an important extension module because IDLE depends on it. Regards, Martin

[Python used to always default to UCS2-Unicode builds; this was changed to default to whatever a possibly installed TCL system is using - hiding the choice from the user and in effect removing the notion of having a Python Unicode default configuration] Martin v. Löwis wrote:
M.-A. Lemburg wrote:
I believe that it would be more appropriate to adjust the _tkinter module to adapt to the TCL Unicode size rather than forcing the complete Python system to adapt to TCL - I don't really see the point in an optional extension module defining the default for the interpreter core.
_tkinter currently supports, for a UCS-2 Tcl, both UCS-2 and UCS-4 Python. For an UCS-4 Tcl, it requires Python also to be UCS-4. Contributions to support the missing case are welcome.
I'm no expert for _tkinter and don't use it, so I'm the wrong one to ask :-) However, making Python's own default depend on some 3rd party software on the machines is bad design.
At the very least, this should be a user controlled option.
It is: by passing --enable-unicode=ucs2, you can force Python to use UCS-2 even if Tcl is UCS-4, with the result that _tkinter cannot be built anymore (and compilation fails with an #error).
I think we should remove the defaulting to whatever TCL uses and instead warn the user about a possible problem in case TCL is found and uses a Unicode width which is incompatible with Python's choice. The user can then decide whether she finds _tkinter important enough to turn away from the standard Python default Unicode width or not (with all the consequences that go with it, e.g. memory bloat, problems installing binaries precompiled for standard Python builds, etc.). This should definitely *not* be done automatically. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 09 2005)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

M.-A. Lemburg wrote:
I think we should remove the defaulting to whatever TCL uses and instead warn the user about a possible problem in case TCL is found and uses a Unicode width which is incompatible with Python's choice.
-1. Regards, Martin

Martin v. Löwis wrote:
M.-A. Lemburg wrote:
I think we should remove the defaulting to whatever TCL uses and instead warn the user about a possible problem in case TCL is found and uses a Unicode width which is incompatible with Python's choice.
-1.
Martin, please reconsider... the choice is between: a) We have a cross-platform default Unicode width setting of UCS2. b) The default Unicode width is undefined and the only thing we can tell the user is: Run the configure script and then try the interpreter to check whether you've got a UCS2 or UCS4 build. Option b) is what the current build system implements and causes problems since the binary interface of the interpreter changes depending on the width of Py_UNICODE making UCS2 and UCS4 builds incompatible. I want to change the --enable-unicode switch back to always use UCS2 as default and add a new option value "tcl" which then triggers the behavior you've added to support _tkinter, ie. --enable-unicode=tcl bases the decision to use UCS2 or UCS4 on the installed TCL interpreter (if there is one). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 10 2005)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

M.-A. Lemburg wrote:
Martin, please reconsider... the choice is between:
The point is that this all was discussed, and decided the other way 'round. There is no point in going back and forth between the two choices: http://mail.python.org/pipermail/python-dev/2003-June/036461.html If we remove the code, people will *again* report that _tkinter stops building on Redhat (see #719880). I see no value in breaking what works now.
a) We have a cross-platform default Unicode width setting of UCS2.
It is hardly the default anymore cross-platform. Many installations on Linux are built as UCS-4 now - no matter what configure does.
b) The default Unicode width is undefined and the only thing we can tell the user is:
Run the configure script and then try the interpreter to check whether you've got a UCS2 or UCS4 build.
It's not at all undefined. There is a precise, deterministic, repeatable algorithm that determines the default, and if people want to know, we can tell them.
I want to change the --enable-unicode switch back to always use UCS2 as default and add a new option value "tcl" which then triggers the behavior you've added to support _tkinter, ie.
--enable-unicode=tcl
bases the decision to use UCS2 or UCS4 on the installed TCL interpreter (if there is one).
Please don't - unless you also go back and re-open the bug reports, change the documentation, tell the Linux packagers that settings have changed, and so on. Why deliberately break what currently works? Regards, Martin

Martin v. Löwis wrote:
M.-A. Lemburg wrote:
Martin, please reconsider... the choice is between:
The point is that this all was discussed, and decided the other way 'round. There is no point in going back and forth between the two choices:
http://mail.python.org/pipermail/python-dev/2003-June/036461.html
So you call two emails to the python-dev list a discussion ? AFAICT, only Barry mildly suggested to have an automatic --enable-unicode=ucs4 switch and then Jeff Epler provided the patch including the warning that the patch wasn't tested and that it does not attempt to make a more educated guess as to where to find tcl.h (unlike setup.py does in order to build _tkinter.c).
If we remove the code, people will *again* report that _tkinter stops building on Redhat (see #719880). I see no value in breaking what works now.
I'm not breaking anything, I'm just correcting the way things have to be configured in an effort to bring back the cross-platforma configure default.
a) We have a cross-platform default Unicode width setting of UCS2.
It is hardly the default anymore cross-platform. Many installations on Linux are built as UCS-4 now - no matter what configure does.
I'm talking about the *configure* default, not the default installation you find on any particular platform (this remains a platform decision to be made by the packagers).
b) The default Unicode width is undefined and the only thing we can tell the user is:
Run the configure script and then try the interpreter to check whether you've got a UCS2 or UCS4 build.
It's not at all undefined. There is a precise, deterministic, repeatable algorithm that determines the default, and if people want to know, we can tell them.
The outcome of the configure tests is bound to be highly random across installations since it depends whether TCL was installed on the system and how it was configured. Furthermore, if a user wants to build against a different TCL version, configure won't detect this change, since it's setup.py that does the _tkinter.c compilation. The main point is that we can no longer tell users: if you run configure without any further options, you will get a UCS2 build of Python. I want to restore this fact which was true before Jeff's patch was applied. Telling users to look at the configure script printout to determine whether they have just built a UCS2 or UCS4 is just not right given its implications.
I want to change the --enable-unicode switch back to always use UCS2 as default and add a new option value "tcl" which then triggers the behavior you've added to support _tkinter, ie.
--enable-unicode=tcl
bases the decision to use UCS2 or UCS4 on the installed TCL interpreter (if there is one).
Please don't - unless you also go back and re-open the bug reports, change the documentation, tell the Linux packagers that settings have changed, and so on.
Why deliberately break what currently works?
It will continue to work - the only change, if any, is to add --enable-unicode=tcl or --enable-unicode=ucs4 (if you know that TCL uses UCS4) to your configure setup. The --enable-unicode=ucs4 configure setting is part of RedHat and SuSE already, so there won't be any changes necessary. BTW, SuSE builds TCL using UCS2 which seems to be the correct choice given this comment in tcl.h: """ * At this time UCS-2 mode is the default and recommended mode. * UCS-4 is experimental and not recommended. It works for the core, * but most extensions expect UCS-2. """ and _tkinter.c built for a UCS4 Python does work with a UCS2 TCL. About the documentation: this still refers to the UCS2 default build and will need to be updated to also mention UCS4 anyway. About the bug reports: feel free to assign them to me. We can have a canned response if necessary, but I doubt that it will be necessary. Explicit is better than implicit :-) -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 13 2005)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

M.-A. Lemburg wrote:
I'm not breaking anything, I'm just correcting the way things have to be configured in an effort to bring back the cross-platforma configure default.
Your proposed change will break the build of Python on Redhat/Fedora systems.
I'm talking about the *configure* default, not the default installation you find on any particular platform (this remains a platform decision to be made by the packagers).
Why is it good to have such a default? Why is that so good that its better than having Tkinter work by default?
The main point is that we can no longer tell users: if you run configure without any further options, you will get a UCS2 build of Python.
It's not a matter of telling the users "no longer". "We" currently don't tell that in any documentation; if you had been telling that users, you were wrong. ./configure --help says that the default for --enable-unicode is "yes".
I want to restore this fact which was true before Jeff's patch was applied.
I understand that you want that. I'm opposed.
Telling users to look at the configure script printout to determine whether they have just built a UCS2 or UCS4 is just not right given its implications.
Right. We should tell them what the procedure is that is used.
It will continue to work - the only change, if any, is to add --enable-unicode=tcl or --enable-unicode=ucs4 (if you know that TCL uses UCS4) to your configure setup. The --enable-unicode=ucs4 configure setting is part of RedHat and SuSE already, so there won't be any changes necessary.
Yes, but users of these systems need to adjust. Regards, Martin

Martin v. Löwis wrote:
M.-A. Lemburg wrote:
I'm not breaking anything, I'm just correcting the way things have to be configured in an effort to bring back the cross-platforma configure default.
Your proposed change will break the build of Python on Redhat/Fedora systems.
You know that this is not true. Python will happily continue to compile on these systems.
I'm talking about the *configure* default, not the default installation you find on any particular platform (this remains a platform decision to be made by the packagers).
Why is it good to have such a default? Why is that so good that its better than having Tkinter work by default?
It is important to be able to rely on a default that is used when no special options are given. The decision to use UCS2 or UCS4 is much too important to be left to a configure script.
The main point is that we can no longer tell users: if you run configure without any further options, you will get a UCS2 build of Python.
It's not a matter of telling the users "no longer". "We" currently don't tell that in any documentation; if you had been telling that users, you were wrong.
./configure --help says that the default for --enable-unicode is "yes".
Let's see: http://www.python.org/peps/pep-0100.html http://www.python.org/peps/pep-0261.html http://www.python.org/doc/2.2.3/whatsnew/node8.html Apart from the mention in the What's New document for Python 2.2 and a FAQ entry, the documentation doesn't mention UCS4 at all. However, you're right: the configure script should print "(default if ucs2)".
I want to restore this fact which was true before Jeff's patch was applied.
I understand that you want that. I'm opposed.
Noted.
Telling users to look at the configure script printout to determine whether they have just built a UCS2 or UCS4 is just not right given its implications.
Right. We should tell them what the procedure is that is used.
No, we should make it an explicit decision by the user running the configure script. BTW, a UCS4 TCL is just as non-standard as a UCS4 Python build. Non-standard build options should never be selected by a configure script all by itself.
It will continue to work - the only change, if any, is to add --enable-unicode=tcl or --enable-unicode=ucs4 (if you know that TCL uses UCS4) to your configure setup. The --enable-unicode=ucs4 configure setting is part of RedHat and SuSE already, so there won't be any changes necessary.
Yes, but users of these systems need to adjust.
Not really: they won't even notice the change in the configure script if they use the system provided Python versions. Or am I missing something ? Regardless of all this discussion, I think we should try to get _tkinter.c to work with a UCS4 TCL version as well. The conversion from UCS4 (Python) to UCS2 (TCL) is already integrated, so adding support for the other way around should be rather straight forward. Any takers ? Regards, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 13 2005)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

M.-A. Lemburg wrote:
It is important to be able to rely on a default that is used when no special options are given. The decision to use UCS2 or UCS4 is much too important to be left to a configure script.
Should the choice be a runtime decision? I think it should be. That could mean two unicode types, a call similar to sys.setdefaultencoding(), a new unicode extension module, or something else. BTW, thanks for discussing these issues. I tried to write a patch to the unicode API documentation, but it's hard to know just what to write. I think I can say this: "sometimes your strings are UTF-16, so you're working with code units that are not necessarily complete code points; sometimes your strings are UCS4, so you're working with code units that are also complete code points. The choice between UTF-16 and UCS4 is made at the time the Python interpreter is compiled and the default choice varies by operating system and configuration." Shane

On May 14, 2005, at 3:05 PM, Shane Hathaway wrote:
M.-A. Lemburg wrote:
It is important to be able to rely on a default that is used when no special options are given. The decision to use UCS2 or UCS4 is much too important to be left to a configure script.
Should the choice be a runtime decision? I think it should be. That could mean two unicode types, a call similar to sys.setdefaultencoding(), a new unicode extension module, or something else.
BTW, thanks for discussing these issues. I tried to write a patch to the unicode API documentation, but it's hard to know just what to write. I think I can say this: "sometimes your strings are UTF-16, so you're working with code units that are not necessarily complete code points; sometimes your strings are UCS4, so you're working with code units that are also complete code points. The choice between UTF-16 and UCS4 is made at the time the Python interpreter is compiled and the default choice varies by operating system and configuration."
Well, if you're going to make it runtime, you might as well do it right. Take away the restriction that the unicode type backing store is forced to be a particular encoding (i.e. get rid of PyUnicode_AS_UNICODE) and give it more flexibility. The implementation of NSString in OpenDarwin's libFoundation <http:// libfoundation.opendarwin.org/> (BSD license), or the CFString implementation in Apple's CoreFoundation <http://developer.apple.com/ darwin/cflite.html> (APSL) would be an excellent place to look for how this can be done. Of course, for backwards compatibility reasons, this would have to be a new type that descends from basestring. text would probably be a good name for it. This would be an abstract implementation, where you can make concrete subclasses that actually implement the various operations as necessary. For example, you could have text_ucs2, text_ucs4, text_ascii, text_codec, etc. The bonus here is you can get people to shut up about space efficient representations, because you can use whatever makes sense. -bob

Nicholas Bastin wrote:
No, that's not true. Python lets you choose UCS-4 or UCS-2. What the default is depends on your platform.
The truth is more complicated. If your Tcl is built for UCS-4, then Python will also be built for UCS-4 (unless overridden by command line). Otherwise, Python will default to UCS-2. Regards, Martin

Nicholas Bastin wrote:
I'm not sure the Python documentation is the place to teach someone about unicode. The ISO 10646 pretty clearly defines UCS-2 as only containing characters in the BMP (plane zero). On the other hand, I don't know why python lets you choose UCS-2 anyhow, since it's almost always not what you want.
It certainly is, in most cases. On Windows, it is the only way to get reasonable interoperability with the platform's WCHAR (i.e. just cast a Py_UNICODE* into a WCHAR*). To a limited degree, in UCS-2 mode, Python has support for surrogate characters (e.g. in UTF-8 codec), so it is not "pure" UCS-2, but this is a minor issue. Regards, Martin

On May 4, 2005, at 6:03 PM, Martin v. Löwis wrote:
Nicholas Bastin wrote:
"This type represents the storage type which is used by Python internally as the basis for holding Unicode ordinals. Extension module developers should make no assumptions about the size of this type on any given platform."
But people want to know "Is Python's Unicode 16-bit or 32-bit?" So the documentation should explicitly say "it depends".
The important piece of information is that it is not guaranteed to be a particular one of those sizes. Once you can't guarantee the size, no one really cares what size it is. The documentation should discourage developers from attempting to manipulate Py_UNICODE directly, which, other than trivia, is the only reason why someone would care what size the internal representation is. -- Nick

Nicholas Bastin wrote:
On May 4, 2005, at 6:03 PM, Martin v. Löwis wrote:
Nicholas Bastin wrote:
"This type represents the storage type which is used by Python internally as the basis for holding Unicode ordinals. Extension module developers should make no assumptions about the size of this type on any given platform."
But people want to know "Is Python's Unicode 16-bit or 32-bit?" So the documentation should explicitly say "it depends".
The important piece of information is that it is not guaranteed to be a particular one of those sizes. Once you can't guarantee the size, no one really cares what size it is. The documentation should discourage developers from attempting to manipulate Py_UNICODE directly, which, other than trivia, is the only reason why someone would care what size the internal representation is.
I don't see why you shouldn't use Py_UNICODE buffer directly. After all, the reason why we have that typedef is to make it possible to program against an abstract type - regardless of its size on the given platform. In that respect it is similar to wchar_t (and all the other *_t typedefs in C). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 06 2005)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

On May 6, 2005, at 3:25 AM, M.-A. Lemburg wrote:
I don't see why you shouldn't use Py_UNICODE buffer directly. After all, the reason why we have that typedef is to make it possible to program against an abstract type - regardless of its size on the given platform.
Because the encoding of that buffer appears to be different depending on the configure options. If that isn't true, then someone needs to change the doc, and the configure options. Right now, it seems *very* clear that Py_UNICODE may either be UCS-2 or UCS-4 encoded if you read the configure help, and you can't use the buffer directly if the encoding is variable. However, you seem to be saying that this isn't true. -- Nick

Nicholas Bastin wrote:
Because the encoding of that buffer appears to be different depending on the configure options.
What makes it appear so? sizeof(Py_UNICODE) changes when you change the option - does that, in your mind, mean that the encoding changes?
If that isn't true, then someone needs to change the doc, and the configure options. Right now, it seems *very* clear that Py_UNICODE may either be UCS-2 or UCS-4 encoded if you read the configure help, and you can't use the buffer directly if the encoding is variable. However, you seem to be saying that this isn't true.
It's a compile-time option (as all configure options). So at run-time, it isn't variable. Regards, Martin

On May 6, 2005, at 7:45 PM, Martin v. Löwis wrote:
Nicholas Bastin wrote:
Because the encoding of that buffer appears to be different depending on the configure options.
What makes it appear so? sizeof(Py_UNICODE) changes when you change the option - does that, in your mind, mean that the encoding changes?
Yes. Not only in my mind, but in the Python source code. If Py_UNICODE is 4 bytes wide, then the encoding is UTF-32 (UCS-4), otherwise the encoding is UTF-16 (*not* UCS-2).
If that isn't true, then someone needs to change the doc, and the configure options. Right now, it seems *very* clear that Py_UNICODE may either be UCS-2 or UCS-4 encoded if you read the configure help, and you can't use the buffer directly if the encoding is variable. However, you seem to be saying that this isn't true.
It's a compile-time option (as all configure options). So at run-time, it isn't variable.
What I mean by 'variable' is that you can't make any assumption as to what the size will be in any given python when you're writing (and building) an extension module. This breaks binary compatibility of extensions modules on the same platform and same version of python across interpreters which may have been built with different configure options. -- Nick

Nicholas Bastin wrote:
Yes. Not only in my mind, but in the Python source code. If Py_UNICODE is 4 bytes wide, then the encoding is UTF-32 (UCS-4), otherwise the encoding is UTF-16 (*not* UCS-2).
I see. Some people equate "encoding" with "encoding scheme"; neither UTF-32 nor UTF-16 is an encoding scheme. You were apparently talking about encoding forms.
What I mean by 'variable' is that you can't make any assumption as to what the size will be in any given python when you're writing (and building) an extension module. This breaks binary compatibility of extensions modules on the same platform and same version of python across interpreters which may have been built with different configure options.
True. The breakage will be quite obvious, in most cases: the module fails to load because not only sizeof(Py_UNICODE) changes, but also the names of all symbols change. Regards, Martin

On May 6, 2005, at 8:25 PM, Martin v. Löwis wrote:
Nicholas Bastin wrote:
Yes. Not only in my mind, but in the Python source code. If Py_UNICODE is 4 bytes wide, then the encoding is UTF-32 (UCS-4), otherwise the encoding is UTF-16 (*not* UCS-2).
I see. Some people equate "encoding" with "encoding scheme"; neither UTF-32 nor UTF-16 is an encoding scheme. You were
That's not true. UTF-16 and UTF-32 are both CES and CEF (although this is not true of UTF-16LE and BE). UTF-32 is a fixed-width encoding form within a code space of (0..10FFFF) and UTF-16 is a variable-width encoding form which provides a mix of one of two 16-bit code units in the code space of (0..FFFF). However, you are perhaps right to point out that people should be more explicit as to which they are referring to. UCS-2, however, is only a CEF, and thus I thought it was obvious that I was referring to UTF-16 as a CEF. I would point anyone who is confused as this point to Unicode Technical Report #17 on the Character Encoding Model, which is much more clear than trying to piece together the relevant parts out of the entire standard. In any event, Python's use of the term UCS-2 is incorrect. I quote from the TR: "The UCS-2 encoding form, which is associated with ISO/IEC 10646 and can only express characters in the BMP, is a fixed-width encoding form." immediately followed by: "In contrast, UTF-16 uses either one or two code units and is able to cover the entire code space of Unicode." If Python is capable of representing the entire code space of Unicode when you choose --unicode=ucs2, then that is a bug. It either should not be called UCS-2, or the interpreter should be bound by the limitations of the UCS-2 CEF.
What I mean by 'variable' is that you can't make any assumption as to what the size will be in any given python when you're writing (and building) an extension module. This breaks binary compatibility of extensions modules on the same platform and same version of python across interpreters which may have been built with different configure options.
True. The breakage will be quite obvious, in most cases: the module fails to load because not only sizeof(Py_UNICODE) changes, but also the names of all symbols change.
Yes, but the important question here is why would we want that? Why doesn't Python just have *one* internal representation of a Unicode character? Having more than one possible definition just creates problems, and provides no value. -- Nick

Nicholas Bastin wrote:
Yes, but the important question here is why would we want that? Why doesn't Python just have *one* internal representation of a Unicode character? Having more than one possible definition just creates problems, and provides no value.
It does provide value, there are good reasons for each setting. Which of the two alternatives do you consider useless? Regards, Martin

On May 7, 2005, at 9:24 AM, Martin v. Löwis wrote:
Nicholas Bastin wrote:
Yes, but the important question here is why would we want that? Why doesn't Python just have *one* internal representation of a Unicode character? Having more than one possible definition just creates problems, and provides no value.
It does provide value, there are good reasons for each setting. Which of the two alternatives do you consider useless?
I don't consider either alternative useless (well, I consider UCS-2 to be largely useless in the general case, but as we've already discussed here, Python isn't really UCS-2). However, I would be a lot happier if we just chose *one*, and all Python's used that one. This would make extension module distribution a lot easier. I'd prefer UTF-16, but I would be perfectly happy with UCS-4. -- Nick

Nicholas Bastin wrote:
I don't consider either alternative useless (well, I consider UCS-2 to be largely useless in the general case, but as we've already discussed here, Python isn't really UCS-2). However, I would be a lot happier if we just chose *one*, and all Python's used that one. This would make extension module distribution a lot easier.
Why is that? For a binary distribution, you have to know the target system in advance, so you also know what size the Unicode type has. For example, on Redhat 9.x, and on Debian Sarge, /usr/bin/python uses a UCS-4 Unicode type. As you have to build binaries specifically for these target systems (because of dependencies on the C library, and perhaps other libraries), building the extension module *on* the target system will just do the right thing.
I'd prefer UTF-16, but I would be perfectly happy with UCS-4.
-1 on the idea of dropping one alternative. They are both used (on different systems), and people rely on both being supported. Regards, Martin

Nicholas Bastin wrote:
The important piece of information is that it is not guaranteed to be a particular one of those sizes. Once you can't guarantee the size, no one really cares what size it is.
Please trust many years of experience: This is just not true. People do care, and they want to know. If we tell them "it depends", they ask "how can I find out".
The documentation should discourage developers from attempting to manipulate Py_UNICODE directly, which, other than trivia, is the only reason why someone would care what size the internal representation is.
Why is that? Of *course* people will have to manipulate Py_UNICODE* buffers directly. What else can they use? Regards, Martin
participants (9)
-
"Martin v. Löwis"
-
Bob Ippolito
-
Fredrik Lundh
-
James Y Knight
-
M.-A. Lemburg
-
Michael Hudson
-
Nicholas Bastin
-
Shane Hathaway
-
Thomas Heller