Re: [Python-Dev] please consider changing --enable-unicode default to ucs4

Zooko O'Whielacronx wrote:
-1 Please note that we did not choose to ship Python as UCS4 binary on Linux - the Linux distributions did. The Python default is UCS2 for a good reason: it's a good trade-off between memory consumption, functionality and performance. As already mentioned, I also don't understand how the changing the Python default on Linux would help your users in any way - if you let distutils compile your extensions, it's automatically going to use the right Unicode setting for you (as well as your users). Unfortunately, this automatic support doesn't help you when shipping e.g. setuptools eggs, but this is a tool problem, not one of Python: setuptools completely ignores the fact that there are two ways to build Python. I'd suggest you ask the tool maintainers to adjust their tools to support the Python Unicode option.
People building their own Python version will usually also build their own extensions, so I don't really believe that the above scenario is very common. Also note that Python will complain loudly when you try to load a UCS2 extension in a UCS4 build and vice-versa. We've made sure that any extension using the Python Unicode C API has to be built for the same UCS version of Python. This is done by using different names for the C APIs at the C level.
Perhaps we should add a FAQ entry for these linker errors (which are caused by the mentioned C API changes to prevent mixing UCS version) ?! Here's a quick way to determine you Python Unicode build type: python -c "import sys;print((sys.maxunicode<66000)and'UCS2'or'UCS4')" Perhaps we should include this info as well as an 32/64-bit indicator and the processor type in the Python startup line: # python Python 2.6 (r26:66714, Feb 3 2009, 20:49:49, UCS4, 64-bit, x86_64) [GCC 4.3.2 [gcc-4_3-branch revision 141291]] on linux2 Type "help", "copyright", "credits" or "license" for more information. This would help users find the right binaries to install as extension.
Which is IMHO what all Linux distributions should have done. Distributions should really not be put in charge of upstream coding design decisions. Regards, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 28 2009)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

M.-A. Lemburg wrote:
There already is one: http://www.python.org/doc/faq/extending/#when-importing-module-x-why-do-i-ge... I wonder why it doesn't show up in the Google searches.
Regards, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 28 2009)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Sep 28, 2009, at 4:25 AM, M.-A. Lemburg wrote:
Distributions should really not be put in charge of upstream coding design decisions.
I don't think you can blame distros for this one.... From PEP 0261: It is also proposed that one day --enable-unicode will just default to the width of your platforms wchar_t. On linux, wchar_t is 4 bytes. If there's a consensus amongst python upstream that all the distros should be shipping Python with UCS2 unicode strings, you should reach out to them and say this, in a rather more clear fashion. Currently, most signs point towards UCS4 builds as being the better option. Or, one might reasonably wonder why UCS-4 is an option at all, if nobody should enable it.
I'd just like to note that I've run into this trap multiple times. I built a custom python, and expected it to work with all the existing, installed, extensions (same major version as the system install, just patched). And then had to build it again with UCS4, for it to actually work. Of course building twice isn't the end of the world, and I'm certainly used to having to twiddle build options on software to get it working, but, this *does* happen, and *is* a tiny bit irritating. James

James Y Knight wrote:
The PEP also has this to say: This has the effect of doubling the size of most Unicode strings. In order to avoid imposing this cost on every user, Python 2.2 will allow the 4-byte implementation as a build-time option. Users can choose whether they care about wide characters or prefer to preserve memory. And that's still true today. It was the main reason for not making it the default on those days. Today, Python 3.x uses Unicode for all strings, so while the RAM situation has changed somewhat since Python 2.2, the change has a much wider effect on the Python memory foot-print than in late 2001.
UCS4 is the better option if you use lots of non-BMP code points and if you have to regularly interface with C APIs using wchar_t on Unix.
Or, one might reasonably wonder why UCS-4 is an option at all, if nobody should enable it.
See above: there are use cases where this does make a lot of sense. E.g. non-BMP code points can only be represented using surrogates on UCS2 builds and these can be tricky to deal with (or at least many people feel like it's tricky to deal with them ;-).
Which is why I think that Python should include some more information on the type of built being used, e.g. by placing the information prominently on the startup line. I still don't believe the above use case is a common one, though. That said, Zooko's original motivation for the proposed change is making installation of extensions easier for users. That's a tools question much more than a Python Unicode one. Aside: --enable-unicode is gone in Python 3.x. You now only have the choice to use the default (which is UCS2) or switch on the optional support for UCS4 by using --with-wide-unicode. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 28 2009)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

2009/9/28 James Y Knight <foom@fuhm.net>:
I've also encountered this trap multiple times. Obviously, the problem is not rebuilding Python which is quick, but to figure out the correct configure option to use (--enable-unicode=ucs4). Others have also spent some time scratching their heads over the strange PyUnicodeUCS4_FromUnicode error the misconfiguration results in, as Zooko's links show. If Python can't infer the unicode setting from the width of the platforms wchar_t, then perhaps it should be mandatory to specify to configure whether you want UCS2 or UCS4? For someone clueless like me, it would be easier to deal with the problem upfront than (much) further down the line. Explicit being better than implicit and all that. -- mvh Björn

Hello,
Isn't this overrated? First, if you have a Python version that has the wrong version, just print out its sys.maxunicode and choose the right version according to that (if sys.maxunicode == 65535, you need to compile an UCS-4 version, otherwise an UCS-2 version). Second, once you have encountered this issue, you know what you need the subsequent times. There are only two possibilities after all.
I'm not sure why someone "clueless" (your word :-)) wants to compile his own Python, though. Regards Antoine.

To do so, you have to know that there is such a configuration option in the first place, and that the error message you get (missing symbols) has anything to do with it. This is quite puzzling to users. Once people know what the problem is, fixing it is indeed easy.
I'm not sure why someone "clueless" (your word :-)) wants to compile his own Python, though.
People install a lot of software that they don't understand. In fact, most people who ever install software don't know how it is written, and cannot enumerate the meaning of all configuration options that the software possesses. In Unix, there is a long tradition that "installing software" means "building from source"; if you have a configure script, you expect that it either works out of the box, or gives an error message if it finds that something is wrong with the environment. So it is quite normal that people who don't understand how the Python interpreter works (or that it has a Unicode type) install Python. Regards, Martin

Dear MAL and python-dev: I failed to explain the problem that users are having. I will try again, and this time I will omit my ideas about how to improve things and just focus on describing the problem. Some users are having trouble using Python packages containing binary extensions on Linux. I want to provide such binary Python packages for Linux for the pycryptopp project (http://allmydata.org/trac/pycryptopp ) and the zfec project (http://allmydata.org/trac/zfec ). I also want to make it possible for users to install the Tahoe-LAFS project (http://allmydata.org ) without having a compiler or Python header files. (You'd be surprised at how often Tahoe-LAFS users try to do this on Linux. Linux is no longer only for people who have the knowledge and patience to compile software themselves.) Tahoe-LAFS also depends on many packages that are maintained by other people and are not packaged or distributed by me -- pyOpenSSL, simplejson, etc.. There have been several hurdles in the way that we've overcome, and no doubt there will be more, but the current hurdle is that there are two "formats" for Python extension modules that are used on Linux -- UCS2 and UCS4. If a user gets a Python package containing a compiled extension module which was built for the wrong UCS2/4 setting, he will get mysterious (to him) "undefined symbol" errors at import time. On Mon, Sep 28, 2009 at 2:25 AM, M.-A. Lemburg <mal@egenix.com> wrote:
The Python default is UCS2 for a good reason: it's a good trade-off between memory consumption, functionality and performance.
I'm sure you are right about this. At some point I will try to measure the performance implications in the context of our application. I don't think it will be an issue for us, as so far no users have complained about any performance or functionality problems that were traceable to the choice of UCS2/4.
My users are using some Python packages built by me and some built by others. The binary packages they get from others could have the incompatible UCS2/4 setting. Also some of my users might be using a python configured with the opposite setting of the python interpreter that I use to build packages.
This is the setuptools/distribute issue that I mentioned: http://bugs.python.org/setuptools/issue78 . If that issue were solved then if a user tried to install a specific package, for example with a command-line like "easy_install http://allmydata.org/source/tahoe/deps/tahoe-dep-eggs/pyOpenSSL-0.8-py2.5-li...", then instead of getting an undefined symbol error at import time, they would get an error message to the effect of "This package is not compatible with your Python interpreter." at install time. That would be good because it would be less confusing to the users. However, if they were using the default setuptools/distribute dependency-satisfaction feature, e.g. because they are installing a package and that package is marked as "install_requires=['pyOpenSSL']", then setuptools/distribute would do its fallback behavior in which it attempts to compile the package from source when it can't find a compatible binary package. This would probably confuse the users at least as much as the undefined symbol error currently does. In any case, improving the tools to handle incompatible packages nicely would not make more packages compatible. Let's do both! Improve tools to handle incompatible packages nicely, and encourage everyone who compiles python on Linux to use the same UCS2/4 setting. Thank you for your attention. Regards, Zooko

Zooko O'Whielacronx wrote:
Zooko, I really fail to see the reasoning here: Why would people who know how to build their own Python interpreter on Linux and expect it to work like the distribution-provided one, have a problem looking up the distribution-used configuration settings ? This is like compiling your own Linux kernel without using the same configuration as the distribution kernel and still expecting the distribution kernel modules to load without problems. Note that this has nothing to do with compiling your own Python extensions. Python's distutils will automatically use the right settings for compiling those, based on the configuration of the Python interpreter used for running the compilation - which will usually be the distribution interpreter. Your argument doesn't really live up to the consequences of switching to UCS4. Just as data-point: eGenix has been shipping binaries for Python packages for several years and while we do occasionally get reports about UCS2/UCS4 mismatches, those are really in the minority. I'd also question using the UCS4 default only on Linux. If we do go for a change, we should use sizeof(wchar_t) as basis for the new default - on all platforms that provide a wchar_t type. However, before we can make such a decision, we need more data about the consequences. That is: * memory footprint changes * performance changes For both Python 2.x and 3.x. After all, UCS4 uses twice as much memory for all Unicode objects as UCS2. Since Python 3.x uses Unicode for all strings, I'd expect such a change to have more impact there. We'd also need to look into possible problems with different compilers using different wchar_t sizes on the same platform (I doubt that there are any). On Windows, the default is fixed since Windows uses UTF-16 for everything Unicode, so UCS2 will for a long time be the only option on that platform. That said, it'll take a while for distributions to upgrade, so you're always better off getting the tools you're using to deal with the problem for you and your users, since those are easier to upgrade. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 07 2009)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

Ronald Oussoren wrote:
Is that true for non-Carbon APIs as well ? This is what I found on the web (in summary): Apple chose to go with UTF-16 at about the same time as Microsoft did and used sizeof(wchar_t) == 2 for Mac OS. When they moved to Mac OS X, they switched wchar_t to sizeof(wchar_t) == 4. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 07 2009)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On 7 Oct, 2009, at 22:13, M.-A. Lemburg wrote:
Both Carbon and the modern APIs use UTF-16. What I don't quite get in the UTF-16 vs. UTF-32 discussion is why UTF-32 would be useful, because if you want to do generic Unicode processing you have to look at sequences of composed characters (base characters + composing marks) anyway instead of separate code points. Not that I'm a unicode expert in any way... Ronald

Ronald Oussoren wrote:
Thanks for that data point. So UTF-16 would be the more natural choice on Mac OS X, despite the choice of sizeof(wchar_t).
Very true. It's one of the reasons why I'm not much of a UCS4-fan - it only helps with surrogates and that's about it. Combining characters, various types of control code points (e.g. joiners, bidirectional marks, breaks, non-breaks, annotations) context sensitive casing, bidirectional marks and other such features found in scripts cause very similar problems - often much harder to solve, since they are not as easily identifiable as surrogate high and low code points. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 07 2009)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

Ronald Oussoren:
Both Carbon and the modern APIs use UTF-16.
If Unicode size standardization is seen as sufficiently beneficial then UTF-16 would be more widely applicable than UTF-32. Unix mostly uses 8-bit APIs which are either explicitly UTF-8 (such as GTK+) or can accept UTF-8 when the locale is set to UTF-8. They don't accept UTF-32. It is possible that Unix could move towards UTF-32 but that hasn't been the case up to now and with both OS X and Windows being UTF-16, it is more likely that UTF-16 APIs will become more popular on Unix. Neil

M.-A. Lemburg wrote:
There already is one: http://www.python.org/doc/faq/extending/#when-importing-module-x-why-do-i-ge... I wonder why it doesn't show up in the Google searches.
Regards, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 28 2009)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Sep 28, 2009, at 4:25 AM, M.-A. Lemburg wrote:
Distributions should really not be put in charge of upstream coding design decisions.
I don't think you can blame distros for this one.... From PEP 0261: It is also proposed that one day --enable-unicode will just default to the width of your platforms wchar_t. On linux, wchar_t is 4 bytes. If there's a consensus amongst python upstream that all the distros should be shipping Python with UCS2 unicode strings, you should reach out to them and say this, in a rather more clear fashion. Currently, most signs point towards UCS4 builds as being the better option. Or, one might reasonably wonder why UCS-4 is an option at all, if nobody should enable it.
I'd just like to note that I've run into this trap multiple times. I built a custom python, and expected it to work with all the existing, installed, extensions (same major version as the system install, just patched). And then had to build it again with UCS4, for it to actually work. Of course building twice isn't the end of the world, and I'm certainly used to having to twiddle build options on software to get it working, but, this *does* happen, and *is* a tiny bit irritating. James

James Y Knight wrote:
The PEP also has this to say: This has the effect of doubling the size of most Unicode strings. In order to avoid imposing this cost on every user, Python 2.2 will allow the 4-byte implementation as a build-time option. Users can choose whether they care about wide characters or prefer to preserve memory. And that's still true today. It was the main reason for not making it the default on those days. Today, Python 3.x uses Unicode for all strings, so while the RAM situation has changed somewhat since Python 2.2, the change has a much wider effect on the Python memory foot-print than in late 2001.
UCS4 is the better option if you use lots of non-BMP code points and if you have to regularly interface with C APIs using wchar_t on Unix.
Or, one might reasonably wonder why UCS-4 is an option at all, if nobody should enable it.
See above: there are use cases where this does make a lot of sense. E.g. non-BMP code points can only be represented using surrogates on UCS2 builds and these can be tricky to deal with (or at least many people feel like it's tricky to deal with them ;-).
Which is why I think that Python should include some more information on the type of built being used, e.g. by placing the information prominently on the startup line. I still don't believe the above use case is a common one, though. That said, Zooko's original motivation for the proposed change is making installation of extensions easier for users. That's a tools question much more than a Python Unicode one. Aside: --enable-unicode is gone in Python 3.x. You now only have the choice to use the default (which is UCS2) or switch on the optional support for UCS4 by using --with-wide-unicode. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 28 2009)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

2009/9/28 James Y Knight <foom@fuhm.net>:
I've also encountered this trap multiple times. Obviously, the problem is not rebuilding Python which is quick, but to figure out the correct configure option to use (--enable-unicode=ucs4). Others have also spent some time scratching their heads over the strange PyUnicodeUCS4_FromUnicode error the misconfiguration results in, as Zooko's links show. If Python can't infer the unicode setting from the width of the platforms wchar_t, then perhaps it should be mandatory to specify to configure whether you want UCS2 or UCS4? For someone clueless like me, it would be easier to deal with the problem upfront than (much) further down the line. Explicit being better than implicit and all that. -- mvh Björn

Hello,
Isn't this overrated? First, if you have a Python version that has the wrong version, just print out its sys.maxunicode and choose the right version according to that (if sys.maxunicode == 65535, you need to compile an UCS-4 version, otherwise an UCS-2 version). Second, once you have encountered this issue, you know what you need the subsequent times. There are only two possibilities after all.
I'm not sure why someone "clueless" (your word :-)) wants to compile his own Python, though. Regards Antoine.

To do so, you have to know that there is such a configuration option in the first place, and that the error message you get (missing symbols) has anything to do with it. This is quite puzzling to users. Once people know what the problem is, fixing it is indeed easy.
I'm not sure why someone "clueless" (your word :-)) wants to compile his own Python, though.
People install a lot of software that they don't understand. In fact, most people who ever install software don't know how it is written, and cannot enumerate the meaning of all configuration options that the software possesses. In Unix, there is a long tradition that "installing software" means "building from source"; if you have a configure script, you expect that it either works out of the box, or gives an error message if it finds that something is wrong with the environment. So it is quite normal that people who don't understand how the Python interpreter works (or that it has a Unicode type) install Python. Regards, Martin

Dear MAL and python-dev: I failed to explain the problem that users are having. I will try again, and this time I will omit my ideas about how to improve things and just focus on describing the problem. Some users are having trouble using Python packages containing binary extensions on Linux. I want to provide such binary Python packages for Linux for the pycryptopp project (http://allmydata.org/trac/pycryptopp ) and the zfec project (http://allmydata.org/trac/zfec ). I also want to make it possible for users to install the Tahoe-LAFS project (http://allmydata.org ) without having a compiler or Python header files. (You'd be surprised at how often Tahoe-LAFS users try to do this on Linux. Linux is no longer only for people who have the knowledge and patience to compile software themselves.) Tahoe-LAFS also depends on many packages that are maintained by other people and are not packaged or distributed by me -- pyOpenSSL, simplejson, etc.. There have been several hurdles in the way that we've overcome, and no doubt there will be more, but the current hurdle is that there are two "formats" for Python extension modules that are used on Linux -- UCS2 and UCS4. If a user gets a Python package containing a compiled extension module which was built for the wrong UCS2/4 setting, he will get mysterious (to him) "undefined symbol" errors at import time. On Mon, Sep 28, 2009 at 2:25 AM, M.-A. Lemburg <mal@egenix.com> wrote:
The Python default is UCS2 for a good reason: it's a good trade-off between memory consumption, functionality and performance.
I'm sure you are right about this. At some point I will try to measure the performance implications in the context of our application. I don't think it will be an issue for us, as so far no users have complained about any performance or functionality problems that were traceable to the choice of UCS2/4.
My users are using some Python packages built by me and some built by others. The binary packages they get from others could have the incompatible UCS2/4 setting. Also some of my users might be using a python configured with the opposite setting of the python interpreter that I use to build packages.
This is the setuptools/distribute issue that I mentioned: http://bugs.python.org/setuptools/issue78 . If that issue were solved then if a user tried to install a specific package, for example with a command-line like "easy_install http://allmydata.org/source/tahoe/deps/tahoe-dep-eggs/pyOpenSSL-0.8-py2.5-li...", then instead of getting an undefined symbol error at import time, they would get an error message to the effect of "This package is not compatible with your Python interpreter." at install time. That would be good because it would be less confusing to the users. However, if they were using the default setuptools/distribute dependency-satisfaction feature, e.g. because they are installing a package and that package is marked as "install_requires=['pyOpenSSL']", then setuptools/distribute would do its fallback behavior in which it attempts to compile the package from source when it can't find a compatible binary package. This would probably confuse the users at least as much as the undefined symbol error currently does. In any case, improving the tools to handle incompatible packages nicely would not make more packages compatible. Let's do both! Improve tools to handle incompatible packages nicely, and encourage everyone who compiles python on Linux to use the same UCS2/4 setting. Thank you for your attention. Regards, Zooko

Zooko O'Whielacronx wrote:
Zooko, I really fail to see the reasoning here: Why would people who know how to build their own Python interpreter on Linux and expect it to work like the distribution-provided one, have a problem looking up the distribution-used configuration settings ? This is like compiling your own Linux kernel without using the same configuration as the distribution kernel and still expecting the distribution kernel modules to load without problems. Note that this has nothing to do with compiling your own Python extensions. Python's distutils will automatically use the right settings for compiling those, based on the configuration of the Python interpreter used for running the compilation - which will usually be the distribution interpreter. Your argument doesn't really live up to the consequences of switching to UCS4. Just as data-point: eGenix has been shipping binaries for Python packages for several years and while we do occasionally get reports about UCS2/UCS4 mismatches, those are really in the minority. I'd also question using the UCS4 default only on Linux. If we do go for a change, we should use sizeof(wchar_t) as basis for the new default - on all platforms that provide a wchar_t type. However, before we can make such a decision, we need more data about the consequences. That is: * memory footprint changes * performance changes For both Python 2.x and 3.x. After all, UCS4 uses twice as much memory for all Unicode objects as UCS2. Since Python 3.x uses Unicode for all strings, I'd expect such a change to have more impact there. We'd also need to look into possible problems with different compilers using different wchar_t sizes on the same platform (I doubt that there are any). On Windows, the default is fixed since Windows uses UTF-16 for everything Unicode, so UCS2 will for a long time be the only option on that platform. That said, it'll take a while for distributions to upgrade, so you're always better off getting the tools you're using to deal with the problem for you and your users, since those are easier to upgrade. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 07 2009)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

Ronald Oussoren wrote:
Is that true for non-Carbon APIs as well ? This is what I found on the web (in summary): Apple chose to go with UTF-16 at about the same time as Microsoft did and used sizeof(wchar_t) == 2 for Mac OS. When they moved to Mac OS X, they switched wchar_t to sizeof(wchar_t) == 4. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 07 2009)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On 7 Oct, 2009, at 22:13, M.-A. Lemburg wrote:
Both Carbon and the modern APIs use UTF-16. What I don't quite get in the UTF-16 vs. UTF-32 discussion is why UTF-32 would be useful, because if you want to do generic Unicode processing you have to look at sequences of composed characters (base characters + composing marks) anyway instead of separate code points. Not that I'm a unicode expert in any way... Ronald

Ronald Oussoren wrote:
Thanks for that data point. So UTF-16 would be the more natural choice on Mac OS X, despite the choice of sizeof(wchar_t).
Very true. It's one of the reasons why I'm not much of a UCS4-fan - it only helps with surrogates and that's about it. Combining characters, various types of control code points (e.g. joiners, bidirectional marks, breaks, non-breaks, annotations) context sensitive casing, bidirectional marks and other such features found in scripts cause very similar problems - often much harder to solve, since they are not as easily identifiable as surrogate high and low code points. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 07 2009)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

Ronald Oussoren:
Both Carbon and the modern APIs use UTF-16.
If Unicode size standardization is seen as sufficiently beneficial then UTF-16 would be more widely applicable than UTF-32. Unix mostly uses 8-bit APIs which are either explicitly UTF-8 (such as GTK+) or can accept UTF-8 when the locale is set to UTF-8. They don't accept UTF-32. It is possible that Unix could move towards UTF-32 but that hasn't been the case up to now and with both OS X and Windows being UTF-16, it is more likely that UTF-16 APIs will become more popular on Unix. Neil
participants (8)
-
"Martin v. Löwis"
-
Antoine Pitrou
-
Björn Lindqvist
-
James Y Knight
-
M.-A. Lemburg
-
Neil Hodgson
-
Ronald Oussoren
-
Zooko O'Whielacronx