[Python-3000] Unicode and OS strings

Marcin 'Qrczak' Kowalczyk qrczak at knm.org.pl
Sat Sep 22 10:18:34 CEST 2007


Dnia 21-09-2007, Pt o godzinie 10:00 -0400, Jim Jewett napisał(a):

> Is it reasonable to expose sys.argv.buffer?
> (Since this would be bytes rather than text, I assume this would be a
> single array, rather than a list of already separated arguments.)

On Unix the arguments are already separated on the OS level. It's the
shell which usually separates them if they were previously written with
spaces between (and understands quotes and other things). The execve()
system call obtains them separated, and the program receives them
separated.

Each Unix argument is a null-terminated array of bytes, i.e. only 0
bytes are disallowed, and the OS does not mangle the contents.

Of course people typically interpret these bytes as characters in a
guessed encoding, and the encoding is always a superset of ASCII.

On Windows the arguments are not separated, the whole command line is a
single string with spaces and possible quotes left for the program to
possibly interpret as separate arguments (unless something has changed
in the last 10 years). I believe it's an array of 16-bit code units,
typically meant to be interpreted as UTF-16, but without checking that
it's a well-formed UTF-16 sequence. I suppose that any 16-bit word
except 0 is allowed, but I'm not sure.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/



More information about the Python-3000 mailing list