[Python-Dev] Python-3.0, unicode, and os.environ

Adam Olsen rhamph at gmail.com
Fri Dec 12 10:12:26 CET 2008

On Fri, Dec 12, 2008 at 1:31 AM, Ulrich Eckhardt
<eckhardt at satorlaser.com> wrote:
> On Thursday 11 December 2008, Adam Olsen wrote:
>> The simplest solution there is to have windows bytes APIs that return
>> raw UTF-16 bytes (note that windows does NOT guaranteed to be valid
>> unicode, despite being much more likely than on linux).
> Actually, I'm not aware of this case. I only know that the OS refuses to mount
> media it can't decode, but that is on the OS-level. Can you give me a hint?

Only pages like this, which indicate the underlying API is an array of WCHAR:


>> The only real issue I see is that UTF-16 isn't an ASCII superset, so it
>> won't print nicely.
> True, but I personally couldn't care less. Actually, I would even prefer if
> printing a byte string always produced \x escaped byte values, that way it
> would at least be consistent.
>> In other words, bytes can be your special type.
> That would actually be a lot of work to do, but I do agree that it would be a
> way.
> The problem though is that I have seen quite a few places in Python where such
> a byte string is passed as 'char*' and treated with the assumption that
> strlen() would yield a meaningful value there, so this calls at least for a
> distinct 'Py_Byte' type. Also, this still doesn't even remotely handle the
> problem that you do have two valid encodings on win32, even though the MBCS
> one could be called deprecated. People will try to interface to other
> libraries that use win32 CHAR strings and that will be much harder or even
> impossible. Further, and that is IMHO the worst part of it, things will fail
> too silently and programmers aren't encouraged to write portable code, but
> maybe I'm just too pessimistic.

char * is just fine.  You need only pass a length along with it.  All
internal APIs *must* already do this, as they support nul bytes.  Also
note that the underlying POSIX APIs prohibit nul bytes in filenames,
so it's irrelevant for them.

If your concern is that people will use MBCS byte strings (produced
how?) in a WCHAR API.. I agree it would be confusing, but not nearly
enough to warrant a special type (which would probably get passed a
MBCS byte string anyway.)

Although I haven't found an official claim that MBCS is deprecated, I
see no reason why it wouldn't be effectively obsoleted by the UTF-16
APIs.  (Certain outdated APIs may be the exception.)  We could have a
way to convert (locale-dependent codec?), but that's as much as we
should care.

Adam Olsen, aka Rhamphoryncus

More information about the Python-Dev mailing list