[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
"Martin v. Löwis"
martin at v.loewis.de
Sat Apr 25 14:07:44 CEST 2009
Cameron Simpson wrote:
> On 22Apr2009 08:50, Martin v. Löwis <martin at v.loewis.de> wrote:
> | File names, environment variables, and command line arguments are
> | defined as being character data in POSIX;
> Specific citation please? I'd like to check the specifics of this.
For example, on environment variables:
# For values to be portable across XSI-conformant systems, the value
# must be composed of characters from the portable character set (except
# NUL and as indicated below).
# Environment variable names used by the utilities in the XCU
# specification consist solely of upper-case letters, digits and the "_"
# (underscore) from the characters defined in Portable Character Set .
# Other characters may be permitted by an implementation;
Or, on command line arguments:
# The arguments represented by arg0, ... are pointers to null-terminated
# character strings
where a character string is "A contiguous sequence of characters
terminated by and including the first null byte.", and a character
# A sequence of one or more bytes representing a single graphic symbol
# or control code. This term corresponds to the ISO C standard term
# multibyte character (multi-byte character), where a single-byte
# character is a special case of a multi-byte character. Unlike the
# usage in the ISO C standard, character here has no necessary
# relationship with storage space, and byte is used when storage space
# is discussed.
> So you're proposing that all POSIX OS interfaces (which use byte strings)
> interpret those byte strings into Python3 str objects, with a codec
> that will accept arbitrary byte sequences losslessly and is totally
> reversible, yes?
> And, I hope, that the os.* interfaces silently use it by default.
> | Applications that need to process the original byte
> | strings can obtain them by encoding the character strings with the
> | file system encoding, passing "python-escape" as the error handler
> | name.
> This last sentence kills the idea for me, unless I'm missing something.
> Which I may be, of course.
> POSIX filesystems _do_not_ have a file system encoding.
Why is that a problem for the PEP?
> If I'm writing a general purpose UNIX tool like chmod or find, I expect
> it to work reliably on _any_ UNIX pathname. It must be totally encoding
> blind. If I speak to the os.* interface to open a file, I expect to hand
> it bytes and have it behave.
See the other messages. If you want to do that, you can continue to.
> I'm very much in favour of being able to work in strings for most
> purposes, but if I use the os.* interfaces on a UNIX system it is
> necessary to be _able_ to work in bytes, because UNIX file pathnames
> are bytes.
Please re-read the PEP. It provides a way of being able to access any
POSIX file name correctly, and still pass strings.
> If there isn't a byte-safe os.* facility in Python3, it will simply be
> unsuitable for writing low level UNIX tools.
Why is that? The mechanism in the PEP is precisely defined to allow
writing low level UNIX tools.
> Finally, I have a small python program whose whole purpose in life
> is to transcode UNIX filenames before transfer to a MacOSX HFS
> directory, because of HFS's enforced particular encoding. What approach
> should a Python app take to transcode UNIX pathnames under your scheme?
Compute the corresponding character strings, and use them.
More information about the Python-Dev