[RELEASED] Python 3.1 final

Mon Jun 29 12:17:17 EDT 2009

On Mon, 29 Jun 2009 13:05:51 +0100, Paul Moore wrote:

>> As for a bytes version of sys.argv and os.environ, you're welcome to
>> propose a patch (this would be a separate issue on the aforementioned
>> issue tracker).
> 
> But please be aware that such a proposal would have to consider:
> 
> 1. That on Windows, the native form is the character version, and the
> bytes version would have to address all the same sorts of encoding
> issues that the OP is complaining about in the character versions. [1]

A bytes version doesn't make sense on Windows (at least, not on the
NT-based versions, and the DOS-based branch isn't worth bothering about,
IMHO).

Also, Windows *needs* to deal with characters due to the
fact that filenames, environment variables, etc are case-insensitive.

> 2. That the proposal address the question of how to write portable,
> robust, code (given that choosing argv vs argv_bytes based on
> sys.platform is unlikely to count as a good option...)

There is a tension here between robustness and portability. In my
situation, robustness means getting the "unadulterated" data. I can always
adulterate it myself if I need to.

> 3. Why defining your own argv_bytes as argv_bytes =
> [a.encode("iso-8859-1", "surrogateescape") for a in sys.argv] is
> insufficient (excluding issues with bugs, which will be fixed
> regardless) for the occasional cases where it's needed.

Other than the bug, it appears to be sufficient. I don't need to support
a locale where nl_langinfo(CODESET) is ISO-2022 (I *do* need to support
lossless round-trip of ISO-2022 filenames, possibly stored in argv and
maybe even in environ, but that's a different matter; the code only
really needs to run with LANG=C).

> [1] And my understanding, from the PEP, is that even on POSIX, the
> argv and environ data is intended to be character data, even though
> the native C APIs expose a byte-oriented interface. So conceptually,
> character format is "correct" on POSIX as well... (But I don't write
> code for POSIX systems, so I'll leave it to the POSIX users to debate
> this point further).

Even if it's "intended" to be character data, it isn't *required* to be.
In particular, it's not required to be in the locale's encoding.

A common example of what I need to handle is:

	find /www ... -print0 | xargs -0 myscript

where the filenames can be in a wide variety of different encodings
(sometimes even within a single directory).