sys.argv as a list of bytes

Thu Jan 19 00:05:32 EST 2012

On Wed, 18 Jan 2012 09:05:42 +0100, Peter Otten wrote:

>> Python has a special errorhandler, "surrogateescape" to deal with
>> bytes that are not valid UTF-8.

On Wed, 18 Jan 2012 11:16:27 +0100, Olive wrote:

> But is it safe even if the locale is not UTF-8?

Yes. Peter's reference to UTF-8 is misleading. The surrogateescape
mechanism is used to represent anything which cannot be decoded according
to the locale's encoding. E.g. in the "C" locale, any byte >= 128 will be
encoded as a surrogate.

On Wed, 18 Jan 2012 09:05:42 +0100, Peter Otten wrote:

> It is still possible to get the original bytes:
> 
> python3 -c'import sys; print(sys.argv[1].encode("utf-8", "surrogateescape"))'

Except, it isn't. Because the Python dev's can't make up their mind which
encoding sys.argv uses, or even document it.

AFAICT:

On Windows, there never was a bytes version of sys.argv to start with
(the OS supplies the command line using wide strings).

On Mac OS X, the command line is always decoded using UTF-8.

On Unix, the command line is decoded using mbstowcs(). There isn't a
Python function to query which encoding this used (if there even _is_ a
corresponding Python encoding).

Except on Windows (where OS APIs take wide string parameters), if a
library function needs to pass a Unicode string to an API function, it
will normally decode it using sys.getfilesystemencoding(), which isn't
guaranteed to be the encoding which was used to fabricate sys.argv in
the first place.

In short: if you need to write "system" scripts on Unix, and you need them
to work reliably, you need to stick with Python 2.x.