[Python-3000] Unicode and OS strings
Gregory P. Smith
greg at krypto.org
Sat Sep 15 22:36:49 CEST 2007
On 9/14/07, Greg Ewing <greg.ewing at canterbury.ac.nz> wrote:
> Hagen Fürstenau wrote:
> > sys.argv could be of type bytes and sys.arguments (or whatever) could be
> > a function taking an encoding parameter (which defaults to UTF-8) and
> > returning strings.
> > Of course that's backwards incompatible and I'm not sure if it's too
> > late for something like this now.
> It would be pretty disruptive to ask everyone to change
> their habit of thinking of sys.argv as a list of strings.
Would it? We're already asking them to convert between bytes and
unicode strings anywhere else I/O is done. I see the command line and
environment as merely more forms of input. The only way to parse them
into data structures automatically is to keep them as bytes. They are
C concepts and can't imply an encoding. As it is, its entirely
possible to have -multiple- encodings on a command line at once as
well as in environment variables. They're all context sensitive.
This isn't going to change.
> I would suggest doing it the other way around -- have
> sys.argv be an object that automatically converts to
> unicode on access, and something else, such as
> sys.argbytes, for getting the raw bytes if that fails.
I'd leave sys.argv bytes and make sys.args/arguments/argstrs be some
best effort parsing. argv is the C/C++ name for bytes, lets not
confuse people. similarly for the environment. os.environ dict
should be bytes object keys and values (or perhaps a bytes object
subclass that refuses null bytes). the os.getenv and os.putenv
functions should take care of any best effort decoding/encoding and
have an optional getenv encoding= parameter to explicitly specify.
More information about the Python-3000