[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Sat Apr 25 14:07:44 CEST 2009

Cameron Simpson wrote:
> On 22Apr2009 08:50, Martin v. Löwis <martin at v.loewis.de> wrote:
> | File names, environment variables, and command line arguments are
> | defined as being character data in POSIX;
> 
> Specific citation please? I'd like to check the specifics of this.

For example, on environment variables:

http://opengroup.org/onlinepubs/007908799/xbd/envvar.html

# For values to be portable across XSI-conformant systems, the value
# must be composed of characters from the portable character set (except
# NUL and as indicated below).

# Environment variable names used by the utilities in the XCU
# specification consist solely of upper-case letters, digits and the "_"
# (underscore) from the characters defined in Portable Character Set .
# Other characters may be permitted by an implementation;

Or, on command line arguments:

http://opengroup.org/onlinepubs/007908799/xsh/execve.html

# The arguments represented by arg0, ... are pointers to null-terminated
# character strings

where a character string is "A contiguous sequence of characters
terminated by and including the first null byte.", and a character
is

# A sequence of one or more bytes representing a single graphic symbol
# or control code. This term corresponds to the ISO C standard term
# multibyte character (multi-byte character), where a single-byte
# character is a special case of a multi-byte character. Unlike the
# usage in the ISO C standard, character here has no necessary
# relationship with storage space, and byte is used when storage space
# is discussed.

> So you're proposing that all POSIX OS interfaces (which use byte strings)
> interpret those byte strings into Python3 str objects, with a codec
> that will accept arbitrary byte sequences losslessly and is totally
> reversible, yes?

Correct.

> And, I hope, that the os.* interfaces silently use it by default.

Correct.

> | Applications that need to process the original byte
> | strings can obtain them by encoding the character strings with the
> | file system encoding, passing "python-escape" as the error handler
> | name.
> 
> -1
> 
> This last sentence kills the idea for me, unless I'm missing something.
> Which I may be, of course.
> 
> POSIX filesystems _do_not_ have a file system encoding.

Why is that a problem for the PEP?

> If I'm writing a general purpose UNIX tool like chmod or find, I expect
> it to work reliably on _any_ UNIX pathname. It must be totally encoding
> blind. If I speak to the os.* interface to open a file, I expect to hand
> it bytes and have it behave.

See the other messages. If you want to do that, you can continue to.

> I'm very much in favour of being able to work in strings for most
> purposes, but if I use the os.* interfaces on a UNIX system it is
> necessary to be _able_ to work in bytes, because UNIX file pathnames
> are bytes.

Please re-read the PEP. It provides a way of being able to access any
POSIX file name correctly, and still pass strings.

> If there isn't a byte-safe os.* facility in Python3, it will simply be
> unsuitable for writing low level UNIX tools.

Why is that? The mechanism in the PEP is precisely defined to allow
writing low level UNIX tools.

> Finally, I have a small python program whose whole purpose in life
> is to transcode UNIX filenames before transfer to a MacOSX HFS
> directory, because of HFS's enforced particular encoding. What approach
> should a Python app take to transcode UNIX pathnames under your scheme?

Compute the corresponding character strings, and use them.

Regards,
Martin