Re: [Python-Dev] Python-3.0, unicode, and os.environ

There has been some discussion here that users should use the str or byte function variant based on what is relevant to their system, for example when getting a list of file names or opening a file. That thought process really doesn't do much for those of us that write code that needs to run on any platform type, without alteration or the addition of complex if-statements and/or exceptions.
Whatever the resolution here, and those of you addressing this thorny issue have my admiration, the solution should be such that it gives consistent behavior regardless of platform type and doesn't require the programmer to know of all the minute details of each possible target platform.
That may not be possible for a while, so interim solutions should be such that it minimizes later pain. If that means hiding "implementation details" behind a new function, so be it. Then, at least, the body of one's app is not burdened with this problem later when conditions change.
I'm glad I'm not the only one with hard problems. ;-)
Larry

On Fri, Dec 5, 2008 at 10:18 PM, Bugbee, Larry larry.bugbee@boeing.com wrote:
There has been some discussion here that users should use the str or byte function variant based on what is relevant to their system, for example when getting a list of file names or opening a file. That thought process really doesn't do much for those of us that write code that needs to run on any platform type, without alteration or the addition of complex if-statements and/or exceptions.
Whatever the resolution here, and those of you addressing this thorny issue have my admiration, the solution should be such that it gives consistent behavior regardless of platform type and doesn't require the programmer to know of all the minute details of each possible target platform.
My prediction is that it won't ever be possible to completely hide this difference between platforms. The platforms differ fundamentally in how they see filenames. An elaborate abstraction can certainly be created that smooths out most of the differences, but at some point useful functionality will have to be lost in order to maintain strict platform independence. This is the fate of most platform-independence abstractions by the way. For example, there are many elaborate packages for platform-independent I/O, but they generally don't provide access to all functionality that is available on a platform. Where they do, the application is once again placed in the position of having to use complex if-statements and/or exceptions.
Consider just this example. Many programs have a need to ask their user for a filename to be created by the program. On systems where filenames are raw byte strings, do you want to provide the user with a way to specify an arbitrary byte string? (That is, in addition to the normal case of entering a text string that will be transformed into a filename using some encoding.) Your choices are either not to support the case of bytes that aren't a valid encoding in the current encoding, or add a UI element to select an encoding, or add a UI element to enter raw bytes. An abstraction package is likely to only support the first option (this is what Java does BTW), but this is not acceptable to all applications.
That may not be possible for a while, so interim solutions should be such that it minimizes later pain. If that means hiding "implementation details" behind a new function, so be it. Then, at least, the body of one's app is not burdened with this problem later when conditions change.
I believe the problem's severity is actually overstated. The interim solution with the least amount of pain that will work for almost all apps is to treat filenames as text strings encoded in some default encoding, and ignore filenames that aren't valid encodings of any text string. Yes, it is possible that you'll find that you can't completely remove or traverse certain directory trees. But that's a fact of life anyway (filesystems have many hidden failure modes), so you're better off dealing with *that* possibility than worrying over the issue of undecodable filenames.

Bugbee, Larry wrote:
There has been some discussion here that users should use the str or byte function variant based on what is relevant to their system, for example when getting a list of file names or opening a file. That thought process really doesn't do much for those of us that write code that needs to run on any platform type, without alteration or the addition of complex if-statements and/or exceptions.
Whatever the resolution here, and those of you addressing this thorny issue have my admiration, the solution should be such that it gives consistent behavior regardless of platform type and doesn't require the programmer to know of all the minute details of each possible target platform.
I've been thinking about this and I can only see one option. I don't think that it really makes less work for the programmer, though -- it just shifts the problem and makes it more apparent what your code is doing.
To avoid exceptions and if-then's in program code when accessing filenames, environment variables, etc, you would need to access each of these resources via the byte API. Then, to avoid having to keep track of what's a string and what's a byte in your other code, you probably want to convert those bytes to strings. This is where the burden gets shifted. You'll have your own routine(s) to do the conversion and have to have exception handling code to deal with undecodable filenames.
Note 1: your particular app might be able to get away without doing the conversion from bytes to string -- it depends on what you're planning on doing with the filename/environment data.
Note 2: If there isn't a parallel API on all platforms, for instance, Guido's proposal to not have os.environb on Windows, then you'll still have to have a platform specific check. (Likely you should try to access os.evironb in this instance and if it doesn't exist, use os.environ instead... and remember that you need to either change os.environ's data into str type or change os.environb's data into byte type.)
-Toshio

Toshio Kuratomi wrote:
Note 2: If there isn't a parallel API on all platforms, for instance, Guido's proposal to not have os.environb on Windows, then you'll still have to have a platform specific check. (Likely you should try to access os.evironb in this instance and if it doesn't exist, use os.environ instead... and remember that you need to either change os.environ's data into str type or change os.environb's data into byte type.)
Note that this is why I personally think the binary API variants *should* exist on Windows, just with the sense of the system encoding flipped around.
That is, on *nix: - underlying OS API uses bytes - binary API just passes values straight through - Unicode API uses the system encoding to encode Unicode names and values to be passed to the OS API and to decode bytes names and values received from the OS API
While on Windows: - underlying OS API uses Unicode - Unicode API just passes values straight through - binary API uses the system encoding to decode bytes names and values to be passed to the OS API and to encode Unicode names and values received from the OS API
Cheers, Nick.

* Nick Coghlan wrote:
Toshio Kuratomi wrote:
Note 2: If there isn't a parallel API on all platforms, for instance, Guido's proposal to not have os.environb on Windows, then you'll still have to have a platform specific check. (Likely you should try to access os.evironb in this instance and if it doesn't exist, use os.environ instead... and remember that you need to either change os.environ's data into str type or change os.environb's data into byte type.)
Note that this is why I personally think the binary API variants *should* exist on Windows, just with the sense of the system encoding flipped around.
That is, on *nix:
- underlying OS API uses bytes
- binary API just passes values straight through
- Unicode API uses the system encoding to encode Unicode names and
values to be passed to the OS API and to decode bytes names and values received from the OS API
While on Windows:
- underlying OS API uses Unicode
- Unicode API just passes values straight through
- binary API uses the system encoding to decode bytes names and values
to be passed to the OS API and to encode Unicode names and values received from the OS API
Now that is somewhat strange. That way you'll have two unreliable APIs and need to switch depending on the platform again.
nd

André Malo wrote:
While on Windows:
- underlying OS API uses Unicode
- Unicode API just passes values straight through
- binary API uses the system encoding to decode bytes names and values
to be passed to the OS API and to encode Unicode names and values received from the OS API
Now that is somewhat strange. That way you'll have two unreliable APIs and need to switch depending on the platform again.
Sory, system encoding was probably a poor choice of words there, since that generally means mbcs when talking about windows (which would indeed be a very poor choice of encoding).
For binary wrappers around the Windows Unicode APIs, I was thinking specifically of using UTF-8, since that should be able to encode anything the Unicode APIs can handle.
Cheers, Nick.

On Sat, Dec 6, 2008 at 6:51 PM, Nick Coghlan ncoghlan@gmail.com wrote:
André Malo wrote:
While on Windows:
- underlying OS API uses Unicode
- Unicode API just passes values straight through
- binary API uses the system encoding to decode bytes names and values
to be passed to the OS API and to encode Unicode names and values received from the OS API
Now that is somewhat strange. That way you'll have two unreliable APIs and need to switch depending on the platform again.
Sory, system encoding was probably a poor choice of words there, since that generally means mbcs when talking about windows (which would indeed be a very poor choice of encoding).
For binary wrappers around the Windows Unicode APIs, I was thinking specifically of using UTF-8, since that should be able to encode anything the Unicode APIs can handle.
If the Unicode APIs only have correct unicode, sure. If not you'll get errors translating to UTF-8 (and the byte APIs are supposed to pass bad names through unaltered.) Kinda ironic, no?

If the Unicode APIs only have correct unicode, sure. If not you'll get errors translating to UTF-8 (and the byte APIs are supposed to pass bad names through unaltered.) Kinda ironic, no?
As far as I can see all Python Unicode strings can be encoded to UTF-8, even things like lone surrogates because Python doesn't care about them. So both the Unicode API and the binary API would be fail-safe on Windows.
- Hagen

Nick Coghlan wrote:
For binary wrappers around the Windows Unicode APIs, I was thinking specifically of using UTF-8, since that should be able to encode anything the Unicode APIs can handle.
Why shouldn't the binary interface just expose the raw utf16 as bytes?
participants (8)
-
Adam Olsen
-
André Malo
-
Bugbee, Larry
-
Greg Ewing
-
Guido van Rossum
-
Hagen Fürstenau
-
Nick Coghlan
-
Toshio Kuratomi