PEP 383: Non-decodable Bytes in System Character Interfaces

I'm proposing the following PEP for inclusion into Python 3.1. Please comment. Regards, Martin PEP: 383 Title: Non-decodable Bytes in System Character Interfaces Version: $Revision: 71793 $ Last-Modified: $Date: 2009-04-22 08:42:06 +0200 (Mi, 22. Apr 2009) $ Author: Martin v. Löwis <martin@v.loewis.de> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 22-Apr-2009 Python-Version: 3.1 Post-History: Abstract ======== File names, environment variables, and command line arguments are defined as being character data in POSIX; the C APIs however allow passing arbitrary bytes - whether these conform to a certain encoding or not. This PEP proposes a means of dealing with such irregularities by embedding the bytes in character strings in such a way that allows recreation of the original byte string. Rationale ========= The C char type is a data type that is commonly used to represent both character data and bytes. Certain POSIX interfaces are specified and widely understood as operating on character data, however, the system call interfaces make no assumption on the encoding of these data, and pass them on as-is. With Python 3, character strings use a Unicode-based internal representation, making it difficult to ignore the encoding of byte strings in the same way that the C interfaces can ignore the encoding. On the other hand, Microsoft Windows NT has correct the original design limitation of Unix, and made it explicit in its system interfaces that these data (file names, environment variables, command line arguments) are indeed character data, by providing a Unicode-based API (keeping a C-char-based one for backwards compatibility). For Python 3, one proposed solution is to provide two sets of APIs: a byte-oriented one, and a character-oriented one, where the character-oriented one would be limited to not being able to represent all data accurately. Unfortunately, for Windows, the situation would be exactly the opposite: the byte-oriented interface cannot represent all data; only the character-oriented API can. As a consequence, libraries and applications that want to support all user data in a cross-platform manner have to accept mish-mash of bytes and characters exactly in the way that caused endless troubles for Python 2.x. With this PEP, a uniform treatment of these data as characters becomes possible. The uniformity is achieved by using specific encoding algorithms, meaning that the data can be converted back to bytes on POSIX systems only if the same encoding is used. Specification ============= On Windows, Python uses the wide character APIs to access character-oriented APIs, allowing direct conversion of the environmental data to Python str objects. On POSIX systems, Python currently applies the locale's encoding to convert the byte data to Unicode. If the locale's encoding is UTF-8, it can represent the full set of Unicode characters, otherwise, only a subset is representable. In the latter case, using private-use characters to represent these bytes would be an option. For UTF-8, doing so would create an ambiguity, as the private-use characters may regularly occur in the input also. To convert non-decodable bytes, a new error handler "python-escape" is introduced, which decodes non-decodable bytes using into a private-use character U+F01xx, which is believed to not conflict with private-use characters that currently exist in Python codecs. The error handler interface is extended to allow the encode error handler to return byte strings immediately, in addition to returning Unicode strings which then get encoded again. If the locale's encoding is UTF-8, the file system encoding is set to a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes (which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF. Discussion ========== While providing a uniform API to non-decodable bytes, this interface has the limitation that chosen representation only "works" if the data get converted back to bytes with the python-escape error handler also. Encoding the data with the locale's encoding and the (default) strict error handler will raise an exception, encoding them with UTF-8 will produce non-sensical data. For most applications, we assume that they eventually pass data received from a system interface back into the same system interfaces. For example, and application invoking os.listdir() will likely pass the result strings back into APIs like os.stat() or open(), which then encodes them back into their original byte representation. Applications that need to process the original byte strings can obtain them by encoding the character strings with the file system encoding, passing "python-escape" as the error handler name. Copyright ========= This document has been placed in the public domain.

Martin v. Löwis wrote:
I'm proposing the following PEP for inclusion into Python 3.1. Please comment.
That seems like a much nicer solution than having parallel bytes/Unicode APIs everywhere. When the locale encoding is UTF-8, would UTF-8b also be used for the command line decoding and environment variable encoding/decoding? (the PEP currently only states that the encoding switch will be done for the file system encoding - it is silent regarding the other two system interfaces). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

Martin v. Löwis wrote:
"correct" -> "corrected"
Would this mean that real private use characters in the file name would raise an exception? How? The UTF-8 decoder doesn't pass those bytes to any error handler.
Then the error callback for encoding would become specific to the target encoding. Would this mean that the handler checks which encoding is used and behaves like "strict" if it doesn't recognize the encoding?
Is this done by the codec, or the error handler? If it's done by the codec I don't see a reason for the "python-escape" error handler.
I thought the error handler would be used for decoding.
"and" -> "an"
Servus, Walter

"correct" -> "corrected"
Thanks, fixed.
The python-escape codec is only used/meaningful if the env encoding is not UTF-8. For any other encoding, it is assumed that no character actually maps to the private-use characters.
Why would it become specific? It can work the same way for any encoding: take U+F01xx, and generate the byte xx.
utf-8b is a new codec. However, the utf-8b codec is only used if the env encoding would otherwise be utf-8. For utf-8b, the error handler is indeed unnecessary.
It's used in both directions: for decoding, it converts \xXX to U+F01XX. For encoding, U+F01XX will trigger an error, which is then handled by the handler to produce \xXX.
"and" -> "an"
Thanks, fixed. Regards, Martin

Martin v. Löwis wrote:
Which should be true for any encoding from the pre-unicode era, but not for UTF-16/32 and variants.
If any error callback emits bytes these byte sequences must be legal in the target encoding, which depends on the target encoding itself. However for the normal use of this error handler this might be irrelevant, because those filenames that get encoded were constructed in such a way that reencoding them regenerates the original byte sequence.
Wouldn't it make more sense to be consistent how non-decodable bytes get decoded? I.e. should the utf-8b codec decode those bytes to PUA characters too (and refuse to encode then, so the error handler outputs them)?
But only for non-UTF8 encodings? Servus, Walter

On 2009-04-22 22:06, Walter Dörwald wrote:
Actually it's not even true for the pre-Unicode codecs. It was and is common for Asian companies to use company specific symbols in private areas or extended versions of CJK character sets. Microsoft even published an editor for Asian users create their own glyphs as needed: http://msdn.microsoft.com/en-us/library/cc194861.aspx Here's an overview for some US companies using such extensions: http://scripts.sil.org/cms/SCRIPTs/page.php?site_id=nrsi&item_id=VendorUseOfPUA (it's no surprise that most of these actually defined their own charsets) SIL even started a registry for the private use areas (PUAs): http://scripts.sil.org/cms/SCRIPTs/page.php?site_id=nrsi&cat_id=UnicodePUA This is their current list of assignments: http://scripts.sil.org/cms/SCRIPTs/page.php?site_id=nrsi&item_id=SILPUAassignments and here's how to register: http://scripts.sil.org/cms/SCRIPTs/page.php?site_id=nrsi&cat_id=UnicodePUA#404a261e -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 22 2009)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

Right. However, these can't appear as environment/file system encodings, because they use null bytes.
No. The whole process started with data having an *invalid* encoding in the source encoding (which, after the roundtrip, is now the target encoding). So the python-escape error handler deliberately produces byte sequences that are invalid in the environment encoding (hence the additional permission of having it produce bytes instead of characters).
Exactly so. The error handler is not of much use outside this specific scenario.
Unfortunately, that won't work. If the original encoding is UTF-8, and uses PUA characters, then, on re-encoding, it's not possible to tell whether to encode as a PUA character, or as an invalid byte. This was my original proposal a year ago, and people immediately suggested that it is not at all acceptable if there is the slightest chance of information loss. Hence the current PEP.
Right. For ease of use, the implementation will specify the error handler regardless, and the recommended use for applications will be to use the error handler regardless. For utf-8b, the error handler will never be invoked, since all input can be converted always. Regards, Martin

MRAB wrote:
I apparently have not expressed it clearly, so please help me improve the text. What I mean is this: - if the environment encoding (for lack of better name) is UTF-8, Python stops using the utf-8 codec under this PEP, and switches to the utf-8b codec. - otherwise (env encoding is not utf-8), undecodable bytes get decoded with the error handler. In this case, U+F01xx will not occur in the byte stream, since no other codec ever produces this PUA character (this is not fully true - UTF-16 may also produce PUA characters, but they can't appear as env encodings). So the case you are referring to should not happen. Regards, Martin

Martin v. Löwis wrote:
I think what's confusing me is that you talk about mapping non-decodable bytes to U+F01xx, but you also talk about decoding to half surrogate codes U+DC80..U+DCFF. If the bytes are mapped to single half surrogate codes instead of the normal pairs (low+high), then I can see that decoding could never be ambiguous and encoding could produce the original bytes.

Martin v. Löwis wrote:
I find the PEP easier to understand now. In detail I'd say that if a sequence of bytes >=0x80 is found which is not valid UTF-8, then the first byte is mapped to a half surrogate and then decoding is continued from the next byte. The only drawback I can see is if the UTF-8 bytes actually decode to a half surrogate. However, half surrogates should really only occur in UTF-16 (as I understand it), so they shouldn't be encoded in UTF-8 anyway! As for handling this case, you could either: 1. Raise an exception (which is what you're trying to avoid) or: 2. Treat it as invalid UTF-8 and map the bytes to half surrogates (encoding would produce the original bytes). I'd prefer option 2. Anyway, +1 from me.

Right: that's the rationale for UTF-8b. Encoding half surrogates violates parts of the Unicode spec, so UTF-8b is "safe".
I hadn't thought of this case, but you are right - they *are* illegal bytes, after all. Raising an exception would be useless since the whole point of this codec is to never raise unicode errors. Regards, Martin

On 06:50 am, martin@v.loewis.de wrote:
I'm proposing the following PEP for inclusion into Python 3.1. Please comment.
-1. On UNIX, character data is not sufficient to represent paths. We must, must, must continue to have a simple bytes interface to these APIs. Covering it up in layers of obscure encoding hacks will not make the problem go away, it will just make it harder to understand. To make matters worse, Linux and GNOME use the PUA for some printable characters. If you open up charmap on an ubuntu system and select "view by unicode character block", then click on "private use area", you'll see many of these. I know that Apple uses at least a few PUA codepoints for the apple logo and the propeller/option icons as well. I am still -1 on any turn-non-decodable-bytes-into-text, because it makes life harder for those of us trying to keep bytes and text straight, but if you absolutely must represent POSIX filenames as mojibake rather than bytes, the only workable solution is to use NUL as your escape character. That's the only code point which _actually_ can't show up in a filename somehow. As we discussed last time, this is what Mono does with System.IO.Path. As a bonus, it's _much_ easier to detect a NUL from random application code than to try to figure out if a string has any half-surrogates or magic PUA characters which shouldn't be interpreted according to platform PUA rules.

On 22/04/2009 14:20, glyph@divmod.com wrote:
As a hg developer, I have to concur. Keeping bytes-based APIs intact would make porting hg to py3k much, much easier. You may be able to imagine that dealing with paths correctly cross-platform on a VCS is a major PITA, and py3k is currently not helping the situation. Cheers, Dirkjan

Yeah, but IIRC a complete set of bytes APIs doesn't exist yet in py3k.
Define complete. I'm not aware of any interfaces wrt. file IO that are lacking, so which ones were you thinking of? Python doesn't currently provide a way to access environment variables and command line arguments as bytes. With the PEP, such a way would actually become available for applications that desire it. Regards, Martin

On Wed, 22 Apr 2009 at 21:21, "Martin v. L�wis" wrote:
Those are the two that I'm thinking of. I think I understand your proposal better now after your example of implementing listdir(bytes). Putting it in the PEP would probably be a good idea. I personally don't have enough practice in actually working with various encodings (or any understanding of unicode escapes) to comment further. --David

Dirkjan Ochtman <dirkjan <at> ochtman.nl> writes:
bytes-based APIs are certainly more bullet-proof under Unix, but it's the reverse under Windows. Martin's proposal aims to bridge the gap and propose something that makes text-based APIs as bullet-proof under Unix as they already are under Windows. Regards Antoine.

Dirkjan Ochtman wrote:
I find these statements contradicting: py3k *is* keeping the byte-based APIs for file names intact, so why is it not helping the situation, when this is what is needed to make porting much, much easier? Regards, Martin

I'd like to respond to this concern in three ways: 1. The PEP doesn't remove any of the existing interfaces. So if the interfaces for byte-oriented file names in 3.0 work fine for you, feel free to continue to use them. 2. Even if they were taken away (which the PEP does not propose to do), it would be easy to emulate them for applications that want them. For example, listdir could be wrapped as def listdir_b(bytestring): fse = sys.getfilesystemencoding() string = bytestring.decode(fse, "python-escape") for fn in os.listdir(string): yield fn.encoded(fse, "python-escape") 3. I still disagree that we must, must, must continue to provide these interfaces. I don't understand from the rest of your message what would *actually* break if people would use the proposed interfaces. Regards, Martin

On 07:17 pm, martin@v.loewis.de wrote:
It's good to know this. It would be good if the PEP made it clear that it is proposing an additional way to work with undecodable bytes, not replacing the existing one. For me, this PEP isn't an acceptable substitute for direct bytes-based access to command-line arguments and environment variables on UNIX. To my knowledge *those* APIs still don't exist yet. I would like it if this PEP were not used as an excuse to avoid adding them.
2. Even if they were taken away (which the PEP does not propose to do), it would be easy to emulate them for applications that want them.
I think this is a pretty clear abstraction inversion. Luckily nobody is proposing it :).
3. I still disagree that we must, must, must continue to provide these interfaces.
You do have a point; if there is a clean, defined mapping between str and bytes in terms of all path/argv/environ APIs, then we don't *need* those APIs, since we can just implement them in terms of characters. But I still think that's a bad idea, since mixing the returned strings with *other* APIs remains problematic. However, I still think the mapping you propose is problematic...
I don't understand from the rest of your message what would *actually* break if people would use the proposed interfaces.
As far as more concrete problems: the utf-8 codec currently in python 2.5 and 2.6, and 3.0 will happily encode half-surrogates, at least in the builds I have. >>> '\udc81'.encode('utf-8').decode('utf-8') '\udc81' So there's an ambiguity when passing U+DC81 to this codec: do you mean \xed\xb2\x81 or do you just mean \x81? Of course it would be possible to make UTF-8B consistent in this regard, but it is still going to interact with code that thinks in terms of actual UTF-8, and the failure mode here is very difficult to inspect. A major problem here is that it's very difficult to puzzle out whether anything *will* actually break. I might be wrong about the above for some subtlety of unicode that I don't quite understand, but I don't want to spend all day experimenting with every possible set of build options, python versions, and unicode specifications. Neither, I wager, do most people who want to call listdir(). Another specific problem: looking at the Character Map application on my desktop, U+F0126 and U+F0127 are considered printable characters. I'm not sure what they're supposed to be, exactly, but there are glyphs there. This is running Ubuntu 8.04; there may be more of these in use in more recent version of GNOME. There is nothing "private" about the "private use" area; Python can never use any of these characters for *anything*, except possibly internally in ways which are never exposed to application code, because the operating system (or window system, or libraries) might use them. If I pass a string with those printable PUA/A characters in it to listdir(), what happens? Do they get turned into bytes, do they only get turned into bytes if my filesystem encoding happens to be something other than UTF-8...? The PEP seems a bit ambiguous to me as far as how the PUA hack and the half-surrogate hack interact. I could be wrong, but it seems to me to be an either-or proposition, in which case there would be *four* bytes types in python 3.1: bytes, bytearray, str-with-PUA/A-junk, str-with- half-surrogate-junk. Detecting the difference would be an expensive and subtle affair; the simplest solution I could think of would be to use an error-prone regex. If the encoding hack used were simply NULL, then the detection would be straightforward: "if '\u0000' in thingy:". Ultimately I think I'm only -0 on all of this now, as long as we get bytes versions of environ and argv. Even if these corner-case issues aren't fixed, those of us who want to have correct handling of undecodable filenames can do so.

On 22Apr2009 21:17, Martin v. L�wis <martin@v.loewis.de> wrote: | > -1. On UNIX, character data is not sufficient to represent paths. We | > must, must, must continue to have a simple bytes interface to these | > APIs. | | I'd like to respond to this concern in three ways: | | 1. The PEP doesn't remove any of the existing interfaces. So if the | interfaces for byte-oriented file names in 3.0 work fine for you, | feel free to continue to use them. Ok. I think I had read things as supplanting byte-oriented interfaces with this exciting new strings-can-do-it-all approach. | 2. Even if they were taken away (which the PEP does not propose to do), | it would be easy to emulate them for applications that want them. | For example, listdir could be wrapped as | | def listdir_b(bytestring): | fse = sys.getfilesystemencoding() Alas, no, because there is no sys.getfilesystemencoding() at the POSIX level. It's only the user's current locale stuff on a UNIX system, and has _nothing_ to do with the filesystem because UNIX filesystems don't have encodings. In particular, because the "best" (or to my mind "misleading") you can do for this is report what the current user thinks: http://docs.python.org/library/sys.html#sys.getfilesystemencoding then there's no guarrentee that what is chosen has any releationship to what was in use when the files being consulted were made. Now, if I were writing listdir_b() I'd want to be able to do something along these lines: - set LC_ALL=C (or some equivalent mechanism) - have os.listdir() read bytes as numeric values and transcode their values _directly_ into the corresponding Unicode code points. - yield bytes( ord(c) for c in os_listdir_string ) - have os.open() et al transcode unicode code points back into bytes. i.e. a straight one-to-one mapping, using only codepoints in the range 1..255. Then I'd have some confidence that I had got hold of the bytes as they had come from the underlying UNIX system call, and a way to get those bytes _back_ to a UNIX system call intact. | string = bytestring.decode(fse, "python-escape") | for fn in os.listdir(string): | yield fn.encoded(fse, "python-escape") | | 3. I still disagree that we must, must, must continue to provide these | interfaces. I don't understand from the rest of your message what | would *actually* break if people would use the proposed interfaces. My other longer message describes what would break, if I understand your proposal. -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/

No, what? No, that algorithm would be incorrect?
So can you produce a specific example where my proposed listdir_b function would fail to work correctly? For it to work, it is not necessary that POSIX has no notion of character sets on the file system level (which is actually not true - POSIX very well recognizes the notion of character sets for file names, and recommends that you restrict yourself to the portable character set).
For this PEP, it's irrelevant. It will work even if the chosen encoding is a bad choice.
That would be an alternative approach to the same problem (and one that I think will fail more badly than the one I'm proposing). Regards, Martin

On 22Apr2009 08:50, Martin v. L�wis <martin@v.loewis.de> wrote: | File names, environment variables, and command line arguments are | defined as being character data in POSIX; Specific citation please? I'd like to check the specifics of this. | the C APIs however allow | passing arbitrary bytes - whether these conform to a certain encoding | or not. Indeed. | This PEP proposes a means of dealing with such irregularities | by embedding the bytes in character strings in such a way that allows | recreation of the original byte string. [...] So you're proposing that all POSIX OS interfaces (which use byte strings) interpret those byte strings into Python3 str objects, with a codec that will accept arbitrary byte sequences losslessly and is totally reversible, yes? And, I hope, that the os.* interfaces silently use it by default. | For most applications, we assume that they eventually pass data | received from a system interface back into the same system | interfaces. For example, and application invoking os.listdir() will | likely pass the result strings back into APIs like os.stat() or | open(), which then encodes them back into their original byte | representation. Applications that need to process the original byte | strings can obtain them by encoding the character strings with the | file system encoding, passing "python-escape" as the error handler | name. -1 This last sentence kills the idea for me, unless I'm missing something. Which I may be, of course. POSIX filesystems _do_not_ have a file system encoding. The user's environment suggests a preferred encoding via the locale stuff, and apps honouring that will make nice looking byte strings as filenames for that user. (Some platforms, like MacOSX' HFS filesystems, _do_ enforce an encoding, and a quite specific variety of UTF-8 it is; I would say they're not a full UNIX filesystem _precisely_ because they reject certain byte strings that are valid on other UNIX filesystems. What will your proposal do here? I can imagine it might cope with existing names, but what happens when the user creates a new name?) Further, different users can use different locales and encodings. If they do it in different work areas they'll be perfectly happy; if they do it in a shared area doubtless confusion will reign, but only in the users' minds, not in the filesystem. If I'm writing a general purpose UNIX tool like chmod or find, I expect it to work reliably on _any_ UNIX pathname. It must be totally encoding blind. If I speak to the os.* interface to open a file, I expect to hand it bytes and have it behave. As an explicit example, I would be just fine with python's open(filename, "w") to take a string and encode it for use, but _not_ ok for os.open() to require me to supply a string and cross my fingers and hope something sane happens when it is turned into bytes for the UNIX system call. I'm very much in favour of being able to work in strings for most purposes, but if I use the os.* interfaces on a UNIX system it is necessary to be _able_ to work in bytes, because UNIX file pathnames are bytes. If there isn't a byte-safe os.* facility in Python3, it will simply be unsuitable for writing low level UNIX tools. And I very much like using Python2 for that. Finally, I have a small python program whose whole purpose in life is to transcode UNIX filenames before transfer to a MacOSX HFS directory, because of HFS's enforced particular encoding. What approach should a Python app take to transcode UNIX pathnames under your scheme? Cheers, -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ The nice thing about standards is that you have so many to choose from; furthermore, if you do not like any of them, you can just wait for next year's model. - Andrew S. Tanenbaum

On 24Apr2009 09:27, I wrote: | If I'm writing a general purpose UNIX tool like chmod or find, I expect | it to work reliably on _any_ UNIX pathname. It must be totally encoding | blind. If I speak to the os.* interface to open a file, I expect to hand | it bytes and have it behave. As an explicit example, I would be just fine | with python's open(filename, "w") to take a string and encode it for use, | but _not_ ok for os.open() to require me to supply a string and cross | my fingers and hope something sane happens when it is turned into bytes | for the UNIX system call. | | I'm very much in favour of being able to work in strings for most | purposes, but if I use the os.* interfaces on a UNIX system it is | necessary to be _able_ to work in bytes, because UNIX file pathnames | are bytes. Just to follow up to my own words here, I would be ok for all the pure-byte stuff to be off in the "posix" module if os.* goes pure character instead of bytes or bytes+strings. -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ ... that, in a few years, all great physical constants will have been approximately estimated, and that the only occupation which will be left to men of science will be to carry these measurements to another place of decimals. - James Clerk Maxwell (1813-1879) Scientific Papers 2, 244, October 1871

Cameron Simpson wrote:
For example, on environment variables: http://opengroup.org/onlinepubs/007908799/xbd/envvar.html # For values to be portable across XSI-conformant systems, the value # must be composed of characters from the portable character set (except # NUL and as indicated below). # Environment variable names used by the utilities in the XCU # specification consist solely of upper-case letters, digits and the "_" # (underscore) from the characters defined in Portable Character Set . # Other characters may be permitted by an implementation; Or, on command line arguments: http://opengroup.org/onlinepubs/007908799/xsh/execve.html # The arguments represented by arg0, ... are pointers to null-terminated # character strings where a character string is "A contiguous sequence of characters terminated by and including the first null byte.", and a character is # A sequence of one or more bytes representing a single graphic symbol # or control code. This term corresponds to the ISO C standard term # multibyte character (multi-byte character), where a single-byte # character is a special case of a multi-byte character. Unlike the # usage in the ISO C standard, character here has no necessary # relationship with storage space, and byte is used when storage space # is discussed.
Correct.
And, I hope, that the os.* interfaces silently use it by default.
Correct.
Why is that a problem for the PEP?
See the other messages. If you want to do that, you can continue to.
Please re-read the PEP. It provides a way of being able to access any POSIX file name correctly, and still pass strings.
If there isn't a byte-safe os.* facility in Python3, it will simply be unsuitable for writing low level UNIX tools.
Why is that? The mechanism in the PEP is precisely defined to allow writing low level UNIX tools.
Compute the corresponding character strings, and use them. Regards, Martin

On 25Apr2009 14:07, "Martin v. Löwis" <martin@v.loewis.de> wrote: | Cameron Simpson wrote: | > On 22Apr2009 08:50, Martin v. Löwis <martin@v.loewis.de> wrote: | > | File names, environment variables, and command line arguments are | > | defined as being character data in POSIX; | > | > Specific citation please? I'd like to check the specifics of this. | For example, on environment variables: | http://opengroup.org/onlinepubs/007908799/xbd/envvar.html [...] | http://opengroup.org/onlinepubs/007908799/xsh/execve.html [...] Thanks. | > So you're proposing that all POSIX OS interfaces (which use byte strings) | > interpret those byte strings into Python3 str objects, with a codec | > that will accept arbitrary byte sequences losslessly and is totally | > reversible, yes? | | Correct. | | > And, I hope, that the os.* interfaces silently use it by default. | | Correct. Ok, then I'm probably good with the PEP. Though I have a quite strong desire to be able to work in bytes at need without doing multiple encode/decode steps. | > | Applications that need to process the original byte | > | strings can obtain them by encoding the character strings with the | > | file system encoding, passing "python-escape" as the error handler | > | name. | > | > -1 | > This last sentence kills the idea for me, unless I'm missing something. | > Which I may be, of course. | > POSIX filesystems _do_not_ have a file system encoding. | | Why is that a problem for the PEP? Because you said above "by encoding the character strings with the file system encoding", which is a fiction. | > If I'm writing a general purpose UNIX tool like chmod or find, I expect | > it to work reliably on _any_ UNIX pathname. It must be totally encoding | > blind. If I speak to the os.* interface to open a file, I expect to hand | > it bytes and have it behave. | | See the other messages. If you want to do that, you can continue to. | | > I'm very much in favour of being able to work in strings for most | > purposes, but if I use the os.* interfaces on a UNIX system it is | > necessary to be _able_ to work in bytes, because UNIX file pathnames | > are bytes. | | Please re-read the PEP. It provides a way of being able to access any | POSIX file name correctly, and still pass strings. | | > If there isn't a byte-safe os.* facility in Python3, it will simply be | > unsuitable for writing low level UNIX tools. | | Why is that? The mechanism in the PEP is precisely defined to allow | writing low level UNIX tools. Then implicitly it's byte safe. Clearly I'm being unclear; I mean original OS-level byte strings must be obtainable undamaged, and it must be possible to create/work on OS objects starting with a byte string as the pathname. | > Finally, I have a small python program whose whole purpose in life | > is to transcode UNIX filenames before transfer to a MacOSX HFS | > directory, because of HFS's enforced particular encoding. What approach | > should a Python app take to transcode UNIX pathnames under your scheme? | | Compute the corresponding character strings, and use them. In Python2 I've been going (ignoring checks for unchanged names): - Obtain the old name and interpret it into a str() "correctly". I mean here that I go: unicode_name = unicode(name, srcencoding) in old Python2 speak. name is a bytes string obtained from listdir() and srcencoding is the encoding known to have been used when the old name was constructed. Eg iso8859-1. - Compute the new name in the desired encoding. For MacOSX HFS, that's: utf8_name = unicodedata.normalize('NFD',unicode_name).encode('utf8') Still in Python2 speak, that's a byte string. - os.rename(name, utf8_name) Under your scheme I imagine this is amended. I would change your listdir_b() function as follows: def listdir_b(bytestring, fse=None): if fse is None: fse = sys.getfilesystemencoding() string = bytestring.decode(fse, "python-escape") for fn in os.listdir(string): yield fn.encoded(fse, "python-escape") So, internally, os.listdir() takes a string and encodes it to an _unspecified_ encoding in bytes, and opens the directory with that byte string using POSIX opendir(3). How does listdir() ensure that the byte string it passes to the underlying opendir(3) is identical to 'bytestring' as passed to listdir_b()? It seems from the PEP that "On POSIX systems, Python currently applies the locale's encoding to convert the byte data to Unicode". Your extension is to augument that by expressing the non-decodable byte sequences in a non-conflicting way for reversal later, yes? That seems to double the complexity of my example application, since it wants to interpret the original bytes in a caller-specified fashion, not using the locale defaults. So I must go: def macify(dirname, srcencoding): # I need this to reverse your encoding scheme fse = sys.getfilesystemencoding() # I'll pretend dirname is ready for use # it possibly has had to undergo the inverse of what happens inside # the loop below for fn in listdir(dirname): # listdir reads POSIX-bytes from readdir(3) # then encodes using the locale encoding, with your escape addition bytename = fn.encoded(fse, "python-escape") oldname = unicode(bytename, srcencoding) newbytename = unicodedata.normalize('NFD',unicode_name).encode('utf8') newname = newbytename.decode(fse, "python-escape") if fn != newname: os.rename(fn, newname) And I'm sure there's some os.path.join() complexity I have omitted. Is that correct? You'll note I need to recode the oldname unicode string because I don't know that fse is the same as the required target MacOSX UTF8 NFD encoding. So if my changes above are correct WRT the PEP, I grant that this is still doable in your scheme. But it would be far far easier with a bytes API. And let us not consider threads or other effects from locale changes during the loop run. I forget what was decided with the pure-bytes interfaces (out of scope for your PEP). Would there be a posix module with a bytes API? Cheers, -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ The old day of Perl's try-it-before-you-use-it are long as gone. Nowadays you can write as many as 20..100 lines of Perl without hitting a bug in the perl implementation. - Ilya Zakharevich <ilya@math.ohio-state.edu>, in the perl-porters list, 22sep1998

On Apr 22, 2009, at 2:50 AM, Martin v. Löwis wrote:
I'm proposing the following PEP for inclusion into Python 3.1. Please comment.
+1. Even if some people still want a low-level bytes API, it's important that the easy case be easy. That is: the majority of Python applications should *just work, damnit* even with not-properly-encoded- in-current-LC_CTYPE filenames. It looks like this proposal accomplishes that, and does so in a relatively nice fashion. James

On Wed, Apr 22, 2009 at 8:50 AM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
Is the second part of this actually true? My understanding may be flawed, but surely all Unicode data can be converted to and from bytes using UTF-8? Obviously not all byte sequences are valid UTF-8, but this doesn't prevent one from creating an arbitrary Unicode string using "utf-8 bytes".decode("utf-8"). Given this, can't people who must have access to all files / environment data just use the bytes interface? Disclosure: My gut reaction is that the solution described in the PEP is a hack, but I'm hardly a character encoding expert. My feeling is that the correct solution is to either standardise on the bytes interface as the lowest common denominator, or to add a Path type (and I guess an EnvironmentalData type) and use the new type to attempt to hide the differences. Schiavo Simon

On approximately 4/24/2009 12:59 AM, came the following characters from the keyboard of Simon Cross:
Oh clearly it is a hack. The right solution of a Path type (and friends) was discarded in earlier discussion, because it would impact too much existing code. The use of bytes would be annoying in the context of py3, where things that you want to display are in str (Unicode). So there is no solution that allows the use of str, and the robustness of bytes, and is 100% compatible with existing practice. Hence the desire is to find a hack that is "good enough". At least, that is my understanding and synopsis. I never saw MvL's original message with the PEP delivered to my mailbox, but some of the replies came there, so I found and extensively replied to it using the Google group / usenet. My reply never showed up here and no one has commented on it either... Should I repost via the mailing list? I think so... I'll just paste it in here, with one tweak I noticed after I sent it fixed... (Sorry Simon, but it is still the same thread, anyway.) (Sorry to others, if my original reply was seen, and just wasn't worth replying to.) On Apr 21, 11:50 pm, "Martin v. Löwis" <mar...@v.loewis.de> wrote:
I'm proposing the following PEP for inclusion into Python 3.1. Please comment.
Basically the scheme doesn't work. Aside from that, it is very close. There are tons of encoding schemes that could work... they don't have to include half-surrogates or bytes. What they have to do, is make sure that they are uniformly applied to all appropriate strings. The problem with this, and other preceding schemes that have been discussed here, is that there is no means of ascertaining whether a particular file name str was obtained from a str API, or was funny- decoded from a bytes API... and thus, there is no means of reliably ascertaining whether a particular filename str should be passed to a str API, or funny-encoded back to bytes. The assumption in the 2nd Discussion paragraph may hold for a large percentage of cases, maybe even including some number of 9s, but it is not guaranteed, and cannot be enforced, therefore there are cases that could fail. Whether those failure cases are a concern or not is an open question. Picking a character (I don't find U+F01xx in the Unicode standard, so I don't know what it is) that is obscure, and unlikely to be used in "real" file names, might help the heuristic nature of the encoding and decoding avoid most conflicts, but provides no guarantee that data puns will not occur in practice. Today's obscure character is tomorrows commonly used character, perhaps. Someone not on this list may be happily using that character for their own nefarious, incompatible purpose. As I realized in the email-sig, in talking about decoding corrupted headers, there is only one way to guarantee this... to encode _all_ character sequences, from _all_ interfaces. Basically it requires reserving an escape character (I'll use ? in these examples -- yes, an ASCII question mark -- happens to be illegal in Windows filenames so all the better on that platform, but the specific character doesn't matter... avoiding / \ and . is probably good, though). So the rules would be, when obtaining a file name from the bytes OS interface, that doesn't properly decode according to UTF-8, decode it by placing a ? at the beginning, then for each decodable UTF-8 sequence, add a Unicode character -- unless the character is ?, in which case you add two ??, and for each non-decodable byte sequence, place a ? and two hex digits, or a ? and a half surrogate code, or a ? and whatever gibberish you like. Two hex digits are fine by me, and will serve for this discussion. ALSO, when obtaining a file name from the str OS interfaces, encode it too... if it contains any ?, then place a ? at the front, and then any other ? in the name must be doubled. Then you have a string that can/must be encoded to be used on either str or bytes OS interfaces... or any other interfaces that want str or bytes... but whichever they want, you can do a decode, or determine that you can't, into that form. The encode and decode functions should be available for coders to use, that code to external interfaces, either OS or 3rd party packages, that do not use this encoding scheme. This encoding scheme would be used throughout all Python APIs (most of which would need very little change to accommodate it). However, programs would have to keep track of whether they were dealing with encoded or unencoded strings, if they use both types in their program (an example, is hard-coded file names or file name parts). The initial ? is not strictly necessary for this scheme to work, but I think it would be a good flag to the user that this name has been altered. This scheme does not depend on assumptions about the use of file names. This scheme would be enhanced if the file name APIs returned a subtype of str for the encoded names, but that should be considered only a hint, not a requirement. When encoding file name strings to pass to bytes APIs, the ? followed by two hex digits would be converted to a byte. Leading ? would be dropped, and ?? would convert to ?. I don't believe failures are possible when encoding to bytes. When encoding file name strings to pass to str APIs, the discovery of ? followed by two hex digits would raise an exception, the file name is not acceptable to a str API. However, leading ? would be dropped, and ?? would convert to ?, and if no ? followed by two hex digits were found, the file name would be successfully converted for use on the str API. Note that not even on Unix/Posix is it particularly easy nor useful to place a ? into file names from command lines due to shell escapes, etc. The use of ? in file names also interferes with easy ability to specifically match them in globs, etc. Anything short of such an encoding of both types of interfaces, such that it is known that all python-manipulated filenames will be encoded, will have data puns that provide a potential for failure in edge cases. Note that in this scheme, no file names that are fully Unicode and do not contain ? characters are altered by the decoding or the encoding process. That will probably reach quite a few 9s of likelihood that the scheme will go unnoticed by most people and programs and filenames. But the scheme will work reliably if implemented correctly and completely, and will have no edge cases of failure due to not having data puns. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

On Fri, Apr 24, 2009 at 11:22 AM, Glenn Linderman <v+python@g.nevcal.com> wrote:
What about keeping the bytes interface (utf-8 encoded Unicode on Windows) and adding a Path type (and friends) interface that mirrors it?
(Sorry Simon, but it is still the same thread, anyway.)
Python discussions do seem to womble through a rather large set of mailing lists and news groups. :) Schiavo Simon

Why is it necessary that you are able to make this distinction?
Picking a character (I don't find U+F01xx in the Unicode standard, so I don't know what it is)
It's a private use area. It will never carry an official character assignment.
I think you'll have to write an alternative PEP if you want to see something like this implemented throughout Python. Regards, Martin

On approximately 4/25/2009 5:22 AM, came the following characters from the keyboard of Martin v. Löwis:
It is necessary that programs (not me) can make the distinction, so that it knows whether or not to do the funny-encoding or not. If a name is funny-decoded when the name is accessed by a directory listing, it needs to be funny-encoded in order to open the file.
I know that U+F0000 - U+FFFFF is a private use area. I don't find a definition of U+F01xx to know what the notation means. Are you picking a particular character within the private use area, or a particular range, or what?
I'm certainly not experienced enough in Python development processes or internals to attempt such, as yet. But somewhere in 25 years of programming, I picked up the knowledge that if you want to have a 1-to-1 reversible mapping, you have to avoid data puns, mappings of two different data values into a single data value. Your PEP, as first written, didn't seem to do that... since there are two interfaces from which to obtain data values, one performing a mapping from bytes to "funny invalid" Unicode, and the other performing no mapping, but accepting any sort of Unicode, possibly including "funny invalid" Unicode, the possibility of data puns seems to exist. I may be misunderstanding something about the use cases that prevent these two sources of "funny invalid" Unicode from ever coexisting, but if so, perhaps you could point it out, or clarify the PEP. I'll try to reread it again... could you post a URL to the most up-to-date version of the PEP, since I haven't seen such appear here, and the version I found via a Google search seems to be the original? -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

On approximately 4/27/2009 12:42 PM, came the following characters from the keyboard of Martin v. Löwis:
So you only need 128 code points, so there is something else unclear. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

Glenn Linderman wrote:
(please understand that this is history now, since the PEP has stopped using PUA characters). No. You seem to assume that all bytes < 128 decode successfully always. I believe this assumption is wrong, in general: py> "\x1b$B' \x1b(B".decode("iso-2022-jp") #2.x syntax Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'iso2022_jp' codec can't decode bytes in position 3-4: illegal multibyte sequence All bytes are below 128, yet it fails to decode. Regards, Martin

On Apr 27, 2009, at 11:35 PM, Martin v. Löwis wrote:
Surely nobody uses iso2022 as an LC_CTYPE encoding. That's expressly forbidden by POSIX, if I'm not mistaken...and I can't see how it would work, considering that it uses all the bytes from 0x20-0x7f, including 0x2f ("/"), to represent non-ascii characters. Hopefully it can be assumed that your locale encoding really is a non- overlapping superset of ASCII, as is required by POSIX... I'm a bit scared at the prospect that U+DCAF could turn into "/", that just screams security vulnerability to me. So I'd like to propose that only 0x80-0xFF <-> U+DC80-U+DCFF should ever be allowed to be encoded/decoded via the error handler. James

James Y Knight wrote:
Hopefully it can be assumed that your locale encoding really is a non-overlapping superset of ASCII, as is required by POSIX...
Can you please point to the part of the POSIX spec that says that such overlapping is forbidden?
It would be actually U+DC2f that would turn into /. I'm happy to exclude that range from the mapping if POSIX really requires an encoding not to be overlapping with ASCII. Regards, Martin

On Apr 28, 2009, at 2:50 AM, Martin v. Löwis wrote:
I can't find it...I would've thought it would be on this page: http://opengroup.org/onlinepubs/007908775/xbd/charset.html but it's not (at least, not obviously). That does say (effectively) that all encodings must be supersets of ASCII and use the same codepoints, though. However, ISO-2022 being inappropriate for LC_CTYPE usage is the entire reason why EUC-JP was created, so I'm pretty sure that it is in fact inappropriate, and I cannot find any evidence of it ever being used on any system. From http://en.wikipedia.org/wiki/EUC-JP: "To get the EUC form of an ISO-2022 character, the most significant bit of each 7-bit byte of the original ISO 2022 codes is set (by adding 128 to each of these original 7-bit codes); this allows software to easily distinguish whether a particular byte in a character string belongs to the ISO-646 code or the ISO-2022 (EUC) code." Also: http://www.cl.cam.ac.uk/~mgk25/ucs/iso2022-wc.html
Yes, I meant to say DC2F, sorry for the confusion.
I'm happy to exclude that range from the mapping if POSIX really requires an encoding not to be overlapping with ASCII.
I think it has to be excluded from mapping in order to not introduce security issues. However... There's also SHIFT-JIS to worry about...which apparently some people actually want to use as their default encoding, despite it being broken to do so. RedHat apparently refuses to provide it as a locale charset (due to its brokenness), and it's also not available by default on my Debian system. People do unfortunately seem to actually use it in real life. https://bugzilla.redhat.com/show_bug.cgi?id=136290 So, I'd like to propose this: The "python-escape" error handler when given a non-decodable byte from 0x80 to 0xFF will produce values of U+DC80 to U+DCFF. When given a non- decodable byte from 0x00 to 0x7F, it will be converted to U+0000-U +007F. On the encoding side, values from U+DC80 to U+DCFF are encoded into 0x80 to 0xFF, and all other characters are treated in whatever way the encoding would normally treat them. This proposal obviously works for all non-overlapping ASCII supersets, where 0x00 to 0x7F always decode to U+00 to U+7F. But it also works for Shift-JIS and other similar ASCII-supersets with overlaps in trailing bytes of a multibyte sequence. So, a sequence like "\x81\xFD".decode("shift-jis", "python-escape") will turn into u"\uDC81\u00fd". Which will then properly encode back into "\x81\xFD". The character sets this *doesn't* work for are: ebcdic code pages (obviously completely unsuitable for a locale encoding on unix), iso2022-* (covered above), and shift-jisx0213 (because it has replaced \ with yen, and - with overline). If it's desirable to work with shift_jisx0213, a modification of the proposal can be made: Change the second sentence to: "When given a non- decodable byte from 0x00 to 0x7F, that byte must be the second or later byte in a multibyte sequence. In such a case, the error handler will produce the encoding of that byte if it was standing alone (thus in most encodings, \x00-\x7f turn into U+00-U+7F)." It sounds from https://bugzilla.novell.com/show_bug.cgi?id=162501 like some people do actually use shift_jisx0213, unfortunately. James

James Y Knight wrote:
I've been thinking of "python-escape" only in terms of UTF-8, the only encoding mentioned in the PEP. In UTF-8, bytes 0x00 to 0x7F are decodable. But if you're talking about using it with other encodings, eg shift-jisx0213, then I'd suggest the following: 1. Bytes 0x00 to 0xFF which can't normally be decoded are decoded to half surrogates U+DC00 to U+DCFF. 2. Bytes which would have decoded to half surrogates U+DC00 to U+DCFF are treated as though they are undecodable bytes. 3. Half surrogates U+DC00 to U+DCFF which can be produced by decoding are encoded to bytes 0x00 to 0xFF. 4. Codepoints, including half surrogates U+DC00 to U+DCFF, which can't be produced by decoding raise an exception. I think I've covered all the possibilities. :-)

On approximately 4/28/2009 11:55 AM, came the following characters from the keyboard of MRAB:
UTF-8 is only mentioned in the sense of having special handling for re-encoding; all the other locales/encodings are implicit. But I also went down that path to some extent.
This makes 256 different escape codes.
2. Bytes which would have decoded to half surrogates U+DC00 to U+DCFF are treated as though they are undecodable bytes.
This provides escaping for the 256 different escape codes, which is lacking from the PEP.
3. Half surrogates U+DC00 to U+DCFF which can be produced by decoding are encoded to bytes 0x00 to 0xFF.
This reverses the escaping.
4. Codepoints, including half surrogates U+DC00 to U+DCFF, which can't be produced by decoding raise an exception.
This is confusing. Did you mean "excluding" instead of "including"?
I think I've covered all the possibilities. :-)
You might have. Seems like there could be a simpler scheme, though... 1. Define an escape codepoint. It could be U+003F or U+DC00 or U+F817 or pretty much any defined Unicode codepoint outside the range U+0100 to U+01FF (see rule 3 for why). Only one escape codepoint is needed, this is easier for humans to comprehend. 2. When the escape codepoint is decoded from the byte stream for a bytes interface or found in a str on the str interface, double it. 3. When an undecodable byte 0xPQ is found, decode to the escape codepoint, followed by codepoint U+01PQ, where P and Q are hex digits. 4. When encoding, a sequence of two escape codepoints would be encoded as one escape codepoint, and a sequence of the escape codepoint followed by codepoint U+01PQ would be encoded as byte 0xPQ. Escape codepoints not followed by the escape codepoint, or by a codepoint in the range U+0100 to U+01FF would raise an exception. 5. Provide functions that will perform the same decoding and encoding as would be done by the system calls, for both bytes and str interfaces. This differs from my previous proposal in three ways: A. Doesn't put a marker at the beginning of the string (which I said wasn't necessary even then). B. Allows for a choice of escape codepoint, the previous proposal suggested a specific one. But the final solution will only have a single one, not a user choice, but an implementation choice. C. Uses the range U+0100 to U+01FF for the escape codes, rather than U+0000 to U+00FF. This avoids introducing the NULL character and escape characters into the decoded str representation, yet still uses characters for which glyphs are commonly available, are non-combining, and are easily distinguishable one from another. Rationale: The use of codepoints with visible glyphs makes the escaped string friendlier to display systems, and to people. I still recommend using U+003F as the escape codepoint, but certainly one with a typcially visible glyph available. This avoids what I consider to be an annoyance with the PEP, that the codepoints used are not ones that are easily displayed, so endecodable names could easily result in long strings of indistinguishable substitution characters. It, like MRAB's proposal, also avoids data puns, which is a major problem with the PEP. I consider this proposal to be easier to understand than MRAB's proposal, or the PEP, because of the single escape codepoint and the use of visible characters. This proposal, like my initial one, also decodes and encodes (just the escape codes) values on the str interfaces. This is necessary to avoid data puns on systems that provide both types of interfaces. This proposal could be used for programs that use str values, and easily migrates to a solution that provides an object that provides an abstraction for system interfaces that have two forms. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

Glenn Linderman wrote:
Speaking personally, I won't call them 'escape codes'. I'd use the term 'escape code' to mean a character that changes the interpretation of the next character(s).
Perhaps I should've said "Any codepoint which can't be produced by decoding should raise an exception". For example, decoding with UTF-8b will never produce U+DC00, therefore attempting to encode U+DC00 should raise an exception and not produce 0x00.
Perhaps the escape character should be U+005C. ;-)

On approximately 4/28/2009 2:01 PM, came the following characters from the keyboard of MRAB:
OK, I won't be offended if you don't call them 'escape codes'. :) But what else to call them? My use of that term is a bit backwards, perhaps... what happens is that because these 256 half surrogates are used to decode otherwise undecodable bytes, they themselves must be "escaped" or translated into something different, when they appear in the byte sequence. The process described reserves a set of codepoints for use, and requires that that same set of codepoints be translated using a similar mechanism to avoid their untranslated appearance in the resulting str. Escape codes have the same sort of characteristic... by replacing their normal use for some other use, they must themselves have a replacement. Anyway, I think we are communicating successfully.
Yes, your rephrasing is clearer, regarding your intention.
Decoding with UTF-8b might never produce U+DC00, but then again, it won't handle the random byte string, either.
Windows users everywhere would love you for that one :)
-- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

Glenn Linderman a écrit :
3. When an undecodable byte 0xPQ is found, decode to the escape codepoint, followed by codepoint U+01PQ, where P and Q are hex digits.
The problem with this strategy is: paths are often sliced, so your 2 codepoints could get separated. The good thing with the PEP's strategy is that 1 character stays 1 character. Baptiste

On approximately 4/29/2009 12:38 AM, came the following characters from the keyboard of Baptiste Carvello:
Except for half-surrogates that are in the file names already, which get converted to 3 characters. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

On approximately 4/28/2009 10:53 AM, came the following characters from the keyboard of James Y Knight:
It would seem from the definition of ISO-2022 that what it calls "escape sequences" is in your POSIX spec called "locking-shift encoding". Therefore, the second bullet item under the "Character Encoding" heading prohibits use of ISO-2022, for whatever uses that document defines (which, since you referenced it, I assume means locales, and possibly file system encodings, but I'm not familiar with the structure of all the POSIX standards documents). A locking-shift encoding (where the state of the character is determined by a shift code that may affect more than the single character following it) cannot be defined with the current character set description file format. Use of a locking-shift encoding with any of the standard utilities in the XCU specification or with any of the functions in the XSH specification that do not specifically mention the effects of state-dependent encoding is implementation-dependent.
Why is that obvious? The only thing I saw that could exclude EBCDIC would be the requirement that the codes be positive in a char, but on a system where the C compiler treats char as unsigned, EBCDIC would qualify. Of course, the use of EBCDIC would also restrict the other possible code pages to those derived from EBCDIC (rather than the bulk of code pages that are derived from ASCII), due to: If the encoded values associated with each member of the portable character set are not invariant across all locales supported by the implementation, the results achieved by an application accessing those locales are unspecified.
-- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

I think it has to be excluded from mapping in order to not introduce security issues.
I think you are right. I have now excluded ASCII bytes from being mapped, effectively not supporting any encodings that are not ASCII compatible. Does that sound ok? Regards, Martin

On Apr 30, 2009, at 5:42 AM, Martin v. Löwis wrote:
Yes. The practical upshot of this is that users who brokenly use "ja_JP.SJIS" as their locale (which, note, first requires editing some files in /var/lib/locales manually to enable its use..) may still have python not work with invalid-in-shift-jis filenames. Since that locale is widely recognized as a bad idea to use, and is not supported by any distros, it certainly doesn't bother me that it isn't 100% supported in python. It seems like the most common reason why people want to use SJIS is to make old pre-unicode apps work right in WINE -- in which case it doesn't actually affect unix python at all. I'd personally be fine with python just declaring that the filesystem- encoding will *always* be utf-8b and ignore the locale...but I expect some other people might complain about that. Of course, application authors can decide to do that themselves by calling sys.setfilesystemencoding('utf-8b') at the start of their program. James

James Y Knight writes:
Mounting external drives, especially USB memory sticks which tend to be FAT-initialized by the manufacturers, is another common case. But I don't understand why PEP 383 needs to care at all.

On approximately 4/27/2009 8:35 PM, came the following characters from the keyboard of Martin v. Löwis:
Yes, but having found the latest PEP finally (at least I hope the one at python.org is the latest, it has quit using PUA anyway), I confirm it is history. But the same issue applies to the range of half-surrogates.
Indeed, that was the missing piece. I'd forgotten about the encodings that use escape sequences, rather than UTF-8, and DBCS. I don't think those encodings are permitted by POSIX file systems, but I suppose they could sneak in via Environment variable values, and the like. The switch from PUA to half-surrogates does not resolve the issues with the encoding not being a 1-to-1 mapping, though. The very fact that you think you can get away with use of lone surrogates means that other people might, accidentally or intentionally, also use lone surrogates for some other purpose. Even in file names. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

On Mon, 2009-04-27 at 22:25 -0700, Glenn Linderman wrote:
This may already have been discussed, and if so I apologise for the for the noise. Does the PEP take into consideration the normalising behaviour of Mac OSX ? We've had some ongoing challenges in bzr related to this with bzr. -Rob

Does the PEP take into consideration the normalising behaviour of Mac OSX ? We've had some ongoing challenges in bzr related to this with bzr.
No, that's completely out of scope, AFAICT. I don't even know what the issues are, so I'm not able to propose a solution, at the moment. Regards, Martin

2009/4/28 Glenn Linderman <v+python@g.nevcal.com>:
It does solve this issue, because (unlike e.g. U+F01FF) '\udcff' is not a valid Unicode character (not a character at all, really) and the only way you can put this in a POSIX filename is if you use a very lenient UTF-8 encoder that gives you b'\xed\xb3\xbf'. Since this byte sequence doesn't represent a valid character when decoded with UTF-8, it should simply be considered an invalid UTF-8 sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not* '\udcff'). Martin: maybe the PEP should say this explicitly? Note that the round-trip works without ambiguities between '\udcff' in the filename: b'\xed\xb3\xbf' -> '\udced\udcb3\udcbf' -> b'\xed\xb3\xbf' and b'\xff' in the filename, decoded by Python to '\udcff': b'\xff' -> '\udcff' -> b'\xff' -- Lino Mastrodomenico

On approximately 4/28/2009 6:01 AM, came the following characters from the keyboard of Lino Mastrodomenico:
Wrong. An 8859-1 locale allows any byte sequence to placed into a POSIX filename. And while U+DCFF is illegal alone in Unicode, it is not illegal in Python str values. And from my testing, Python 3's current UTF-8 encoder will happily provide exactly the bytes value you mention when given U+DCFF.
Others have made this suggestion, and it is helpful to the PEP, but not sufficient. As implemented as an error handler, I'm not sure that the b'\xed\xb3\xbf' sequence would trigger the error handler, if the UTF-8 decoder is happy with it. Which, in my testing, it is. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

On 27Apr2009 00:07, Glenn Linderman <v+python@g.nevcal.com> wrote:
I would say this isn't so. It's important that programs know if they're dealing with strings-for-filenames, but not that they be able to figure that out "a priori" if handed a bare string (especially since they can't:-)
Hmm. I had thought that legitimate unicode strings already get transcoded to bytes via the mapping specified by sys.getfilesystemencoding() (the user's locale). That already happens I believe, and Martin's scheme doesn't change this. He's just funny-encoding non-decodable byte sequences, not the decoded stuff that surrounds them. So it is already the case that strings get decoded to bytes by calls like open(). Martin isn't changing that. I suppose if your program carefully constructs a unicode string riddled with half-surrogates etc and imagines something specific should happen to them on the way to being POSIX bytes then you might have a problem... I think the advantage to Martin's choice of encoding-for-undecodable-bytes is that it _doesn't_ use normal characters for the special bits. This means that _all_ normal characters are left unmangled un both "bare" and "funny-encoded" strings. Because of that, I now think I'm -1 on your "use printable characters for the encoding". I think presentation of the special characters _should_ look bogus in an app (eg little rectangles or whatever in a GUI); it's a fine flashing red light to the user. Also, by avoiding reuse of legitimate characters in the encoding we can avoid your issue with losing track of where a string came from; legitimate characters are currently untouched by Martin's scheme, except for the normal "bytes<->string via the user's locale" translation that must already happen, and there you're aided by byets and strings being different types.
Please elucidate the "second source" of strings. I'm presuming you mean strings egenrated from scratch rather than obtained by something like listdir(). Given such a string with "funny invalid" stuff in it, and _absent_ Martin's scheme, what do you expect the source of the strings to _expect_ to happen to them if passed to open()? They still have to be converted to bytes at the POSIX layer anyway. Cheers, -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ Heaven could change from chocolate to vanilla without violating perfection. - arromdee@jyusenkyou.cs.jhu.edu (Ken Arromdee)

On approximately 4/27/2009 2:14 PM, came the following characters from the keyboard of Cameron Simpson:
So you agree they can't... that there are data puns. (OK, you may not have thought that through)
So assume a non-decodable sequence in a name. That puts us into Martin's funny-decode scheme. His funny-decode scheme produces a bare string, indistinguishable from a bare string that would be produced by a str API that happens to contain that same sequence. Data puns. So when open is handed the string, should it open the file with the name that matches the string, or the file with the name that funny-decodes to the same string? It can't know, unless it knows that the string is a funny-decoded string or not.
So it is already the case that strings get decoded to bytes by calls like open(). Martin isn't changing that.
I thought the process of converting strings to bytes is called encoding. You seem to be calling it decoding?
Right. Or someone else's program does that. I only want to use Unicode file names. But if those other file names exist, I want to be able to access them, and not accidentally get a different file.
Whether the characters used for funny decoding are normal or abnormal, unless they are prevented from also appearing in filenames when they are obtained from or passed to other APIs, there is the possibility that the funny-decoded name also exists in the filesystem by the funny-decoded name... a data pun on the name. Whether the characters used for funny decoding are normal or abnormal, if they are not prevented from also appearing in filenames when they are obtained from or passed to other APIs, then in order to prevent data puns, *all* names must be passed through the decoder, and the decoder must perform a 1-to-1 reversible mapping. Martin's funny-decode process does not perform a 1-to-1 reversible mapping (unless he's changed it from the version of the PEP I found to read). This is why some people have suggested using the null character for the decoding, because it and / can't appear in POSIX file names, but everything else can. But that makes it really hard to display the funny-decoded characters.
The reason I picked a ASCII printable character is just to make it easier for humans to see the encoding. The scheme would also work with a non-ASCII non-printable character... but I fail to see how that would help a human compare the strings on a display of file names. Having a bunch of abnormal characters in a row, displayed using a single replacement glyph, just makes an annoying mess in the file open dialog.
There are abnormal characters, but there are no illegal characters. NTFS permits any 16-bit "character" code, including abnormal ones, including half-surrogates, and including full surrogate sequences that decode to PUA characters. POSIX permits all byte sequences, including things that look like UTF-8, things that don't look like UTF-8, things that look like half-surrogates, and things that look like full surrogate sequences that decode to PUA characters. So whether the decoding/encoding scheme uses common characters, or uncommon characters, you still have the issue of data puns, unless you use a 1-to-1 transformation, that is reversible. With ASCII strings, I think no one questions that you need to escape the escape characters. C uses \ as an escape character... Everyone understands that if you want to use a \ in a C string, you have to use \\ instead... and that scheme has escaped the boundaries of C to other use cases. But it seems that you think that if we could just find one more character that no one else uses, that we wouldn't have to escape it.... and that could be true, but there aren't any characters that no one else uses. So whatever character (and a range makes it worse) you pick, someone else uses it. So in order for the scheme to work, you have to escape the escape character(s), even in names that wouldn't otherwise need to be funny-decoded.
POSIX has byte APIs for strings, that's one source, that is most under discussion. Windows has both bytes and 16-bit APIs for strings... the 16-bit APIs are generally mapped directly to UTF-16, but are not checked for UTF-16 validity, so all of Martin's funny-decoded files could be used for Windows file names on the 16-bit APIs. And yes, strings can be generated from scratch.
There is a fine encoding scheme that can take any str and encode to bytes: UTF-8. The problem is that UTF-8 doesn't work to take any byte sequence and decode to str, and that means that special handling has to happen when such byte sequences are encountered. But there is no str that can be generated that can't be generated in other ways, which would be properly encoded to a different byte sequence. Hence there are data puns, no 1-to-1 mapping. Hence it seems obvious to me that the only complete solution is to have an escape character, and ensure that all strings are decoded and encoded. As soon as you have an escape character, then you can decode anything into displayable, standard, Unicode, and you can create the reverse encoding unambiguously. Without an escape character, you just have a heuristic that will work sometimes, and break sometimes. If you believe non-UTF-8-decodable byte sequences are rare, you can ignore them. That's what we do now, but people squawk. If you believe that you can invent an encoding that has data puns, and that because of the character or characters involved are rare, that the problems that result can be ignored, fine... but people will squawk when they hit the problem... I'm just trying to squawk now, to point out that this is complexity for complexities sake, it adds complexity to trade one problem for a different problem, under the belief that the other problem is somehow rarer than the first. And maybe it is, today. I'd much rather have a solution that actually solves the problem. If you don't like ? as the escape character, then pick U+10F01, and anytime a U+10F01 is encountered in a file name, double it. And anytime there is an undecodable byte sequence, emit U+10F01, and then U+80 through U+FF as a subsequent character for the first byte in the undecodable sequence, and restart the decoder with the next byte. That'll work too. But use of rare, abnormal characters to take the place of undecodable bytes can never work, because of data puns, and valid use of the rare, abnormal characters. Someone suggested treating the byte sequences of the rare, abnormal characters as undecodable bytes, and decoding them using the same substitution rules. That would work too, if applied consistently, because then the rare, abnormal characters would each be escaped. But having 128 escape characters seems more complex than necessary, also. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

On 27Apr2009 18:15, Glenn Linderman <v+python@g.nevcal.com> wrote:
I agree you can't examine a string and know if it came from the os.* munging or from someone else's munging. I totally disagree that this is a problem. There may be puns. So what? Use the right strings for the right purpose and all will be well. I think what is missing here, and missing from Martin's PEP, is some utility functions for the os.* namespace. PROPOSAL: add to the PEP the following functions: os.fsdecode(bytes) -> funny-encoded Unicode This is what os.listdir() does to produce the strings it hands out. os.fsencode(funny-string) -> bytes This is what open(filename,..) does to turn the filename into bytes for the POSIX open. os.pathencode(your-string) -> funny-encoded-Unicode This is what you must do to a de novo string to turn it into a string suitable for use by open. Importantly, for most strings not hand crafted to have weird sequences in them, it is a no-op. But it will recode your puns for survival. and for me, I would like to see: os.setfilesystemencoding(coding) Currently os.getfilesystemencoding() returns you the encoding based on the current locale, and (I trust) the os.* stuff encodes on that basis. setfilesystemencoding() would override that, unless coding==None in what case it reverts to the former "use the user's current locale" behaviour. (We have locale "C" for what one might otherwise expect None to mean:-) The idea here is to let to program control the codec used for filenames for special purposes, without working indirectly through the locale.
See my proposal above. Does it address your concerns? A program still must know the providence of the string, and _if_ you're working with non-decodable sequences in a names then you should transmute then into the funny encoding using the os.pathencode() function described above. In this way the punning issue can be avoided. _Lacking_ such a function, your punning concern is valid.
True. open() should always expect a funny-encoded name.
My head must be standing in the wrong place. Yes, I probably mean encoding here. I'm trying to accompany these terms with little pictures like "string->bytes" to avoid confusion.
Point taken. And I think addressed by the utility function proposed above. [...snip normal versus odd chars for the funny-encoding ...]
I though half-surrogates were illegal in well formed Unicode. I confess to being weak in this area. By "legitimate" above I meant things like half-surrogates which, like quarks, should not occur alone?
Sure. I'm not really talking about what filesystem will accept at the native layer, I was talking in the python funny-encoded space. [..."escaping is necessary"... I agree...]
These are existing file objects, I'll take them as source 1. They get encoded for release by os.listdir() et al.
And yes, strings can be generated from scratch.
I take this to be source 2. I think I agree with all the discussion that followed, and think the real problem is lack of utlities functions to funny-encode source 2 strings for use. hence the proposal above. Cheers, -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ Be smart, be safe, be paranoid. - Ryan Cousineau, courier@compdyn.com DoD#863, KotRB, KotKWaWCRH

2009/4/27 Cameron Simpson <cs@zip.com.au>:
Time machine! http://docs.python.org/dev/py3k/library/sys.html#sys.setfilesystemencoding -- Regards, Benjamin

On 27Apr2009 21:58, Benjamin Peterson <benjamin@python.org> wrote: | 2009/4/27 Cameron Simpson <cs@zip.com.au>: | > PROPOSAL: add to the PEP the following functions: [...] | > and for me, I would like to see: | > os.setfilesystemencoding(coding) | > | > Currently os.getfilesystemencoding() returns you the encoding based on | > the current locale, and (I trust) the os.* stuff encodes on that basis. | > setfilesystemencoding() would override that, unless coding==None in what | > case it reverts to the former "use the user's current locale" behaviour. | > (We have locale "C" for what one might otherwise expect None to mean:-) | | Time machine! http://docs.python.org/dev/py3k/library/sys.html#sys.setfilesystemencoding How embarrassing. I thought I'd looked. It doesn't have the None->return-to-default mode, and I'd like to see the word "overwritten" replaced by "overidden". And of course if Martin's PEP gets adopted then the "e.g." cleause needs replacing:-) -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ Do not taunt Happy Fun Coder.

On approximately 4/27/2009 7:11 PM, came the following characters from the keyboard of Cameron Simpson:
Seems like one would also desire os.pathdecode to do the reverse. And also versions that take or produce bytes from funny-encoded strings. Then, if programs were re-coded to perform these transformations on what you call de novo strings, then the scheme would work. But I think a large part of the incentive for the PEP is to try to invent a scheme that intentionally allows for the puns, so that programs do not need to be recoded in this manner, and yet still work. I don't think such a scheme exists. If there is going to be a required transformation from de novo strings to funny-encoded strings, then why not make one that people can actually see and compare and decode from the displayable form, by using displayable characters instead of lone surrogates?
"Illegal" just means violating the accepted rules. In this case, the accepted rules are those enforced by the file system (at the bytes or str API levels), and by Python (for the str manipulations). None of those rules outlaw lone surrogates. Hence, while all of the systems under discussion can handle all Unicode characters in one way or another, none of them require that all Unicode rules are followed. Yes, you are correct that lone surrogates are illegal in Unicode. No, none of the accepted rules for these systems require Unicode.
One variation of source 2 is reading output from other programs, such as ls (POSIX) or dir (Windows).
I think we understand each other now. I think your proposal could work, Cameron, although when recoding applications to use your proposal, I'd find it easier to use the "file name object" that others have proposed. I think that because either your proposal or the object proposals require recoding the application, that they will not be accepted. I think that because the PEP 383 allows data puns, that it should not be accepted in its present form. I think your if your proposal is accepted, that it then becomes possible to use an encoding that uses visible characters, which makes it easier for people to understand and verify. An encoding such as the one I suggested, but perhaps using a more obscure character, if there is one, but yet doesn't violate true Unicode. I think it should transform all data, from str and bytes interfaces, and produce only str values containing conforming Unicode, escaping all the non-conforming sequences in some manner. This would make the strings truly readable, as long as fonts for all the characters are available. And I had already suggested the utility functions you are suggesting, actually, in my first tirade against PEP 383 (search for "The encode and decode functions should be available for coders to use, that code to external interfaces, either OS or 3rd party packages, that do not use this encoding scheme"). I really don't care if you or who gets the credit for the idea, others may have suggested it before me, but I do care that the solution should provide functionality that works without ambiguity/data puns. The solution that was proposed in the lead up to releasing Python 3.0 was to offer both bytes and str interfaces (so we have those), and then for those that want to have a single portable implementation that can access all data, an object that encapsulates the differences, and the variant system APIs. (file system is one, command line is another, environment is another, I'm not sure if there are more.) I haven't heard if any progress on such an encapsulating object has been made; the people that proposed such have been rather quiet about this PEP. I would expect that an object implementation would provide display strings, and APIs to submit de novo str and bytes values to an object, which would run the appropriate encoding on them. Programs that want to use str interfaces on POSIX will see a subset of files on systems that contain files whose bytes filenames are not decodable. If a sysadmin wants to standardize on UTF-8 names universally, they can use something like convmv to clean up existing file names that don't conform. Programs that use str interfaces on POSIX system will work fine, but with a subset of the files. When that is unacceptable, they can either be recoded to use the bytes interfaces, or the hopefully forthcoming object encapsulation. The issue then will be what technique will be used to transform bytes into display names, but since the display names would never be fed back to the objects directly (but the object would have an interface to accept de novo str and de novo bytes) then it is just a display issue, and one that uses visible characters would seem more useful in my mind, than one that uses half-surrogates or PUAs. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

I think I may be able to resolve Glenn's issues with the scheme lower down (through careful use of definitions and hand waving). On 27Apr2009 23:52, Glenn Linderman <v+python@g.nevcal.com> wrote:
Yes.
And also versions that take or produce bytes from funny-encoded strings.
Isn't that the first two functions above?
I agree no such scheme exists. I don't think it can, just using strings. But _unless_ you have made a de novo handcrafted string with ill-formed sequences in it, you don't need to bother because you won't _have_ puns. If Martin's using half surrogates to encode "undecodable" bytes, then no normal string should conflict because a normal string will contain _only_ Unicode scalar values. Half surrogate code points are not such. The advantage here is that unless you've deliberately constructed an ill-formed unicode string, you _do_not_ need to recode into funny-encoding, because you are already compatible. Somewhat like one doesn't need to recode ASCII into UTF-8, because ASCII is unchanged.
Because that would _not_ be a no-op for well formed Unicode strings. That reason is sufficient for me. I consider the fact that well-formed Unicode -> funny-encoded is a no-op to be an enormous feature of Martin's scheme. Unless I'm missing something, there _are_no_puns_ between funny-encoded strings and well formed unicode strings.
I've just spent a cosy 20 minutes with my copy of Unicode 5.0 and a coffee, reading section 3.9 (Unicode Encoding Forms). I now do not believe your scenario makes sense. Someone can construct a Python3 string containing code points that includes surrogates. Granted. However such a string is not meaningful because it is not well-formed (D85). It's ill-formed (D84). It is not sane to expect it to translate into a POSIX byte sequence, be it UTF-8 or anything else, unless it is accompanied by some kind of explicit mapping provided by the programmer. Absent that mapping, it's nonsense in much the same way that a non-decodable UTF-8 byte sequence is nonsense. For example, Martin's funny-encoding is such an explicit mapping.
But those other names _don't_ exist.
I think that either we've lost track of what each other is saying, or you're wrong here. And my poor terminology hasn't been helping. What we've got: (1) Byte sequence files names in the POSIX file system. It doesn't matter whether the underlying storage is a real POSIX filesystem or mostly POSIX one like MacOSX HFS or a remotely attached non-POSIX filesystem like a Windows one, because we're talking through the POSIX API, and it is handing us byte sequences, which will expect may contain anything except a NUL. (2) Under Martin's scheme, os.listdir() et al hand us (and accept) funny-encoded Python3 strings, which are strings of Unicode code units (D77). Particularly, if there were bytes in the POSIX byte string that did not decode into Unicode scalar values (D76) then each such byte is encoded as a surrogate (D71,72,73,74). it is important to note here that because surrogates are _not_ Unicode scalar values, the is no punning between the two sets of values. (3) Other Python3 strings that have not been through Martin's mangler in either direction. Ordinary strings. Your concern is that, handed a string, a programmer could misuse (3) as (2) or vice versa because of punning. In a well-formed unicode string there are no surrogates; surrogates only occur in UTF-16 _encodings_ of Unicode strings (D75). Therefore, it _is_ possible to inspect a string, if one cared, to see if it is funny-encoded or "raw". One may get two different answers: - If there are surrogate code units then it must be funny-encoded and will therefore work perfectly if handed to a os.* interface. - If there are no surrogate code units the it may be funny encoded or it may not have been through Martin's funny-encoder, you can't tell. However, this doesn't matter because the encoder is a no-op for such strings. Therefore it will work perfectly if handed to an os.* interface. The only gap in this is a specially crated string containing surrogate code points that did not come via Martin's encoder. But such a string cannot come from a user interface, which will accept only characters and there only include unicode scalar values. Such a string can only be explicitly constructed (eg with a \uD802 code point). And if something constructs such a string, it must have in mind an explicit interpretation of those code points, which means it is the _constructor_ on whom the burden of translation lies. Does this make sesne to you, or have you a counter example in mind?
However, Martin's scheme explicitly translates these ill-formed sequences into Python3 strings and back, losslessly. You can have surrogates in the filesystem storage/API on Windows. You can have non-UTF-8-decodable sequences in the POSIX filesystem layer too. They're all taken in and handled. In Python3 space, one might have a bytes object with a raw POSIX byte filename in it. Presumably one can also have a byte string with a raw (UTF-16) WIndows filename in it. They're not strings, so no confusion. But there's no _string_ for these things without a matching string<->bytestring mapping associated with it. If you have a Python3 string which is well-formed Unicode, then you can hand it to the os.* interfaces and the Right Thing will happen (on Windows just because it stored Unicode and on POSIX provided you agree that your locale/getfilesystemencoding() is the right thing). If you have a string that isn't well-formed, then the meaning of any code points which are not Unicode scalar values is not well defined without some auxiliary stuff in the app.
See above. I think this is addressed. [...]
Sure. But that is reading byte sequences, and one must again know the encoding. If that is known and the input decoded happily into Unicode scalar values, then there is no issue. If the input didn't decode, then one must make some decision about what the non-decodable bits mean.
I'm of the option now that the puns can only occur when the source 2 string has surrogates, and either those surrogates are chosen to match the funny-encoding, in which case the pun is not a pun, or the surrogates are chosen according to a different scheme in which case source 2 is obliged to provide a mapping. A source 2 string of only Unicode scalar values doesn't need remapping.
I think any scheme that uses any Unicode scalar value as an escape character _inherently_ introduces puns, and puns that are easier to encounter. I think the real strength of Martin's scheme is exactly that bytes strings that needed the funny-encoding _do_ produce ill-formed Unicode strings, because such strings _cannot_ conflict with well-formed strings. I think your desire for a human readable encoding is valid, but it should be a further purely "presentation" step, somewhat like quoted-printable encoding in MIME, and not the scheme used by Martin.
But I think it would just move the punning. A human readable string with readable escapes in it may be funny-encoded. _Or_ it may be "raw", with funny-encoded yet to happen; after all only might weirdly be dealing with a filename which contained post-funny-encode visible sequences in it. SO you're right back to _guessing_ what you're looking at. WIth the surrogate scheme you only have to guess if there are surrogates, but then you _know_ that you're dealing with a special encoding scheme; it is certain - the guess is about which scheme. If you're working in a domain with no ill-formed strings you never need to worry at all. With a visible/printable-encoding such as you advocate the guess is about whether the scheme have even been used, which is why I think it is worse.
I must have missed that sentence. But it sounds like we want the same facilities at least.
I think covering these other cases is quite messy, if only because there's not even agreement amonst existing command line apps about all that stuff. Regarding "APIs to submit de novo str and bytes values to an object, which would run the appropriate encoding on them" I think such a facility for de novo strings must require the caller to provide a handler/mapper for the not-well-formed parts of such strings if they occur.
Not under Martin's scheme, because all bytes filenames _are_ decoded.
I agree it might be handy to have a display function, but isn't repr() exactly that, now I think of it? Cheers, -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ "waste cycles drawing trendy 3D junk" - Mac Eudora v3 config option

On approximately 4/28/2009 4:06 PM, came the following characters from the keyboard of Cameron Simpson:
I think I may be able to resolve Glenn's issues with the scheme lower down (through careful use of definitions and hand waving).
Close. You at least resolved what you thought my issue was. And, you did make me more comfortable with the idea that I, in programs I write, would not be adversely affected by the PEP if implemented. While I can see that the PEP no doubt solves the os.listdir / open problem on POSIX systems for Python 3 + PEP programs that don't use 3rd party libraries, it does require programs that do use 3rd party libraries to be recoded with your functions -- which so far the PEP hasn't embraced. Or, to use the bytes APIs directly to get file names for 3rd party libraries -- but the directly ported, filenames-as-strings type of applications that could call 3rd party filenames-as-bytes libraries in 2.x must be tweaked to do something different than they did before.
Yes, sorry.
Right. And I don't intend to generate ill-formed Unicode strings, in my programs. But I might well read their names from other sources. It is nice, and thank you for emphasizing (although I already did realize it, back there in the far reaches of the brain) that all the data puns are between ill-formed Unicode strings, and undecodable bytes strings. That is a nice property of the PEP's encoding/decoding method. I'm not sure it outweighs the disadvantage of taking unreadable gibberish, and producing indecipherable gibberish (codepoints with no glyphs), though, when there are ways to produce decipherable gibberish instead... or at least mostly-decipherable gibberish. Another idea forms.... described below.
I think you are correct regarding where the puns are. I agree that not perturbing well-formed Unicode is a benefit.
Such a string can be meaningful if it is used as a file name... it is the name of the file. I will agree that it would not be a word in any language, because it is composed of things that are not characters / codepoints, if that is what you meant.
They do if someone constructs them.
Lots of configuration systems permit schemes like C's \x to be used to create strings. Whether you perceive that to be a user interface or not, or believe that such things should be part of a user interface or not, they exist. Whether they validate that such strings are properly constructed Unicode text or should or should not do such validation, is open for discussion, but I'd be surprised if there are not some such schemes that don't do such checking, and consider it a feature. Why make the file name longer than necessary, when you can just use all these nice illegal codepoints to keep it shorter instead? Instead of 5 characters for a filename sequence counter, someone might stuff it in 1 character, in binary, and think they were clever. I've seen such techniques, although not specifically in Python, since I'm fairly new to reading Python code. So I consider it not beyond the realm of possibility to encounter lone surrogate code units in strings that haven't been through Martin's funny-encoder. Hence, I disbelieve that the gap you mention can be ignored.
It is still not clear whether the PEP (1) would be implemented on Windows (2) if it is, if it prevents lone surrogates from being obtained from the str APIs, by transcoding them into 3 lone surrogates, and if doesn't transcode from the str APIs, but does funny-decode from the bytes APIs, then it would seem there is still the possibility of data puns on Windows.
Without transcoding on the str APIs, which I haven't seen mentioned, I don't think so.
Sure. So the PEP needs your functions, or the equivalent. Last I checked, they weren't there.
A correct translation of source 2 strings would be obliged to call one of your functions, that doesn't exist in the PEP, because it appears the PEP wants to assume that such strings don't exist, unless it creates them. So this takes porting effort for programs generating and consuming such strings, to avoid being mangled by the PEP. That isn't necessary today, only post-PEP.
Another step? Even more porting effort? For a PEP that is trying to avoid porting effort? But maybe there is a compromise that mostly meets both goals: use U+DC10 as a (high-flying) escape character. It is not printable, so the substitution glyph will likely get displayed by display functions. Then transcode illegal bytes to the range U+0100 to U+01FF, and transcode existing U+DC10 to U+DC10 U+DC10. 1) This is an easy to understand scheme, and illegal byte values would become displayable, but would each be preceded by the substitution glyph for the U+DC10. 2) There would be no need to transcode other lone surrogates... on the other hand, any illegal code values could be treated as illegal bytes and transcoded, making the strings more nearly legal, and more uniformly displayable. 3) The property that all potential data puns are among ill-formed Unicode strings is still retained. 4) Because the result string is nearly legal Unicode (except for the escape characters U+DC10), it becomes uniformly comparable and different strings can be visibly different. 5) It is still necessary to transcode names from str interfaces, to escape any U+DC10 characters, at least, which is also required by this PEP to avoid data puns on systems that have both str and bytes interfaces.
I think you mean you don't have to guess if there are lone surrogates... you can look and see.
So the above scheme, using a U+DC10 escape character, meets your desirable truisms about lone surrogates being the trigger for knowing that you are dealing with bizarro names, but being uncertain about which kind, and also makes the results lots more readable. I still think there is a need to provide the encoding and decoding functions, for both bytes and de novo strings.
The caller shouldn't have to supply anything. The same encoding that is applied to str system interfaces that supply strings should be applied to de novo strings. It is just a matter of transcoding a de novo string into the "right form" that it can then be encoded by the system encoder to produce the original string again, if it goes to a str interface, or to an equivalent bytes string, if it goes to a bytes interface.
I think I was speaking of the status quo, here, not with the PEP.
repr is a display function that produces rather ugly results in most non-ASCII cases. But then again, one could use repr as the funny-encoding scheme, too... I don't think we want to use repr for either case, actually. Of course, with Py 3, if the file names were objects, and could have reprlib customizations... :) :) -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

Glenn Linderman a écrit :
The problem with your "escape character" scheme is that the meaning is lost with slicing of the strings, which is a very common operation.
Python could as well *specify* that lone surrogates are illegal, as their meaning is undefined by Unicode. If this rule is respected language-wise, there is no ambiguity. It might be unrealistic on windows, though. This rule could even be specified only for strings that represent filesystem paths. Sure, they are the same type as other strings, but the programmer usually knows if a given string is intended to be a path or not. Baptiste

2009/4/28 Glenn Linderman <v+python@g.nevcal.com>:
Sorry for picking on Glenn's comment - it's only one of many in this thread. But it seems to me that there is an assumption that problems will arise when code gets a potentially funny-decoded string and doesn't know where it came from. Is that a real concern? How many programs really don't know where their data came from? Maybe a general-purpose library routine *might* just need to document explicitly how it handles funny-encoded data (I can't actually imagine anything that would, but I'll concede it may be possible) but that's just a matter of documenting your assumptions - no better or worse than many other cases. This all sounds similar to the idea of "tainted" data in security - if you lose track of untrusted data from the environment, you expose yourself to potential security issues. So the same techniques should be relevant here (including ignoring it if your application isn't such that it's s concern!) I've yet to hear anyone claim that they would have an actual problem with a specific piece of code they have written. (NB, if such a claim has been made, feel free to point me to it - I admit I've been skimming this thread at times). Paul.

Paul Moore <p.f.moore <at> gmail.com> writes:
I've yet to hear anyone claim that they would have an actual problem with a specific piece of code they have written.
Yep, that's the problem. Lots of theoretical problems noone has ever encountered brought up against a PEP which resolves some actual problems people encounter on a regular basis. For the record, I'm +1 on the PEP being accepted and implemented as soon as possible (preferably before 3.1). Regards Antoine.


For what it's worth, the OSX API's seem to behave as follows: * If you create a file with an non-UTF8 name on a HFS+ filesystem the system automaticly encodes the name. That is, open(chr(255), 'w') will silently create a file named '%FF' instead of the name you'd expect on a unix system. * If you mount an NFS filesystem from a linux host and that directory contains a file named chr(255) - unix-level tools will see a file with the expected name (just like on linux) - Cocoa's NSFileManager returns u"?" as the filename, that is when the filename cannot be decoded using UTF-8 the name returned by the high- level API is mangled. This is regardless of the setting of LANG. - I haven't found a way yet to access files whose names are not valid UTF-8 using the high-level Cocoa API's. The latter two are interesting because Cocoa has a unicode filesystem API on top of a POSIX C-API, just like Python 3.x. I guess the choosen behaviour works out on OSX (where users are unlikely to run into this issue), but could be more problematic on other POSIX systems. Ronald On 28 Apr, 2009, at 14:03, Michael Foord wrote:

Ronald Oussoren <ronaldoussoren@mac.com> (RO) wrote:
RO> That is, open(chr(255), 'w') will silently create a file named '%FF' RO> instead of the name you'd expect on a unix system.
Not for me (I am using Python 2.6.2).
I once got a tar file from a Linux system which contained a file with a non-ASCII, ISO-8859-1 encoded filename. The tar file refused to be unpacked on a HFS+ filesystem. -- Piet van Oostrum <piet@cs.uu.nl> URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4] Private email: piet@vanoostrum.org

Ned Deily <nad@acm.org> (ND) wrote:
ND> What version of OSX are you using? On Tiger 10.4.11 I see the failure ND> you see but on Leopard 10.5.6 the behavior Ronald reports.
Yes, I am using Tiger (10.4.11). Interesting that it has changed on Leopard. -- Piet van Oostrum <piet@cs.uu.nl> URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4] Private email: piet@vanoostrum.org

You can get the same error on Linux: $ python Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41) [GCC 4.3.3] on linux2 Type "help", "copyright", "credits" or "license" for more information.
(Some file system drivers do not enforce valid utf8 yet, but I suspect they will in the future.) Tom

Thomas Breuel wrote:
Do you suspect that from discussing the issue with kernel developers or reading a thread on lkml? If not, then you're suspicion seems to be pretty groundless.... The fact that VFAT enforces an encoding does not lend itself to your argument for two reasons: 1) VFAT is not a Unix filesystem. It's a filesystem that's compatible with Windows/DOS. If Windows and DOS have filesystem encodings, then it makes sense for that driver to enforce that as well. Filesystems intended to be used natively on Linux/Unix do not necessarily make this design decision. 2) The encoding is specified when mounting the filesystem. This means that you can still mix encodings in a number of ways. If you mount with an encoding that has full byte coverage, for instance, each user can put filenames from different encodings on there. If you mount with utf8 on a system which uses euc-jp as the default encoding, you can have full paths that contain a mix of utf-8 and euc-jp. Etc. -Toshio

On Fri, 1 May 2009 06:55:48 am Thomas Breuel wrote:
Works for me under Fedora using ext3 as the file system. $ python2.6 Python 2.6.1 (r261:67515, Dec 24 2008, 00:33:13) [GCC 4.1.2 20070502 (Red Hat 4.1.2-12)] on linux2 Type "help", "copyright", "credits" or "license" for more information.
Given that chr(255) is a valid filename on my file system, I would consider it a bug if Python couldn't deal with a file with that name. -- Steven D'Aprano

On 30 Apr, 2009, at 21:33, Piet van Oostrum wrote:
That's odd. Which version of OSX do you use? ronald@Rivendell-2[0]$ sw_vers ProductName: Mac OS X ProductVersion: 10.5.6 BuildVersion: 9G55 [~/testdir] ronald@Rivendell-2[0]$ /usr/bin/python Python 2.5.1 (r251:54863, Jan 13 2009, 10:26:13) [GCC 4.0.1 (Apple Inc. build 5465)] on darwin Type "help", "copyright", "credits" or "license" for more information.
And likewise with python 2.6.1+ (after cleaning the directory): [~/testdir] ronald@Rivendell-2[0]$ python2.6 Python 2.6.1+ (release26-maint:70603, Mar 26 2009, 08:38:03) [GCC 4.0.1 (Apple Inc. build 5493)] on darwin Type "help", "copyright", "credits" or "license" for more information.

How can you bring up practical problems against something that hasn't been implemented? The fact that no other language or library does this is perhaps an indication that it isn't the right thing to do. But the biggest problem with the proposal is that it isn't needed: if you want to be able to turn arbitrary byte sequences into unicode strings and back, just set your encoding to iso8859-15. That already works and it doesn't require any changes. Tom

Thomas Breuel <tmbdev <at> gmail.com> writes:
How can you bring up practical problems against something that hasn't been
The fact that no other language or library does this is perhaps an indication
implemented? The PEP is simple enough that you can simulate its effect by manually computing the resulting unicode string for a hypothetical broken filename. Several people have already done so in this thread. that it isn't the right thing to do. According to some messages, it seems Java and Mono actually use this kind of workaround. Though I haven't checked (I don't use those languages).
That doesn't work at all. With your proposal, any non-ASCII filename will be unreadable; not only the broken ones. Antoine.

On 28Apr2009 14:37, Thomas Breuel <tmbdev@gmail.com> wrote: | But the biggest problem with the proposal is that it isn't needed: if you | want to be able to turn arbitrary byte sequences into unicode strings and | back, just set your encoding to iso8859-15. That already works and it | doesn't require any changes. No it doesn't. It does transcode without throwing exceptions. On POSIX. (On Windows? I doubt it - windows isn't using an 8-bit scheme. I believe.) But it utter destorys any hope of working in any other locale nicely. The PEP lets you work losslessly in other locales. It _may_ require some app care for particular very weird strings that don't come from the filesystem, but as far as I can see only in circumstances where such care would be needed anyway i.e. you've got to do special stuff for weirdness in the first place. Weird == "ill-formed unicode string" here. Cheers, -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ I just kept it wide-open thinking it would correct itself. Then I ran out of talent. - C. Fittipaldi

On 28Apr2009 11:49, Antoine Pitrou <solipsis@pitrou.net> wrote: | Paul Moore <p.f.moore <at> gmail.com> writes: | > | > I've yet to hear anyone claim that they would have an actual problem | > with a specific piece of code they have written. | | Yep, that's the problem. Lots of theoretical problems noone has ever encountered | brought up against a PEP which resolves some actual problems people encounter on | a regular basis. | | For the record, I'm +1 on the PEP being accepted and implemented as soon as | possible (preferably before 3.1). I am also +1 on this. I would like utility functions to perform: os-bytes->funny-encoded funny-encoded->os-bytes or explicit example code snippets for same in the PEP text. -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ This person is currently undergoing electric shock therapy at Agnews Developmental Center in San Jose, California. All his opinions are static, please ignore him. Thank you, Nurse Ratched - the sig quote of Bob "Another beer, please" Christ <bhatch@netcom.com>

On 29Apr2009 08:27, Martin v. L?wis <martin@v.loewis.de> wrote: | > I would like utility functions to perform: | > os-bytes->funny-encoded | > funny-encoded->os-bytes | > or explicit example code snippets for same in the PEP text. | | Done! Thanks! -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/

Paul Moore writes:
Yes, it's a real concern. I don't think it's possible to show a small piece of code one could point at and say "without a better API I bet you can't write this correctly," though. Rather, my experience with Emacs and various mail packages is that without type information it is impossible to keep track of the myriad bits and pieces of text that are recombining like pig flu, and eventually one breaks out and causes an error. It's usually easy to fix, but so are the next hundred similar regressions, and in the meantime a hundred users have suffered more or less damage or at least annoyance. There's no question that dealing with escapes of funny-decoded strings to uprepared code paths is mission creep compared to Martin's stated purpose for PEP 383, but it is also a real problem.

Simon Cross wrote:
[I hope, by "second part", you refer to the part that I left] It's true that UTF-8 could represent all Windows file names. However, the byte-oriented APIs of Windows do not use UTF-8, but instead, they use the Windows ANSI code page (which varies with the installation).
No, because the Windows API would interpret the bytes differently, and not find the right file. Regards, Martin

Why not use U+DCxx for non-UTF-8 encodings too?
I thought of that, and was tricked into believing that only U+DC8x is a half surrogate. Now I see that you are right, and have fixed the PEP accordingly. Regards, Martin

Thanks for writing this PEP 383, MvL. I recently ran into this problem in Python 2.x in the Tahoe project [1]. The Tahoe project should be considered a good use case showing what some people need. For example, the assumption that a file will later be written back into the same local filesystem (and thus luckily use the same encoding) from which it originally came doesn't hold for us, because Tahoe is used for file-sharing as well as for backup-and-restore. One of my first conclusions in pursuing this issue is that we can never use the Python 2.x unicode APIs on Linux, just as we can never use the Python 2.x str APIs on Windows [2]. (You mentioned this ugliness in your PEP.) My next conclusion was that the Linux way of doing encoding of filenames really sucks compared to, for example, the Mac OS X way. I'm heartened to see what David Wheeler is trying to persuade the maintainers of Linux filesystems to improve some of this: [3]. My final conclusion was that we needed to have two kinds of workaround for the Linux suckage: first, if decoding using the suggested filesystem encoding fails, then we fall back to mojibake [4] by decoding with iso-8859-1 (or else with windows-1252 -- I'm not sure if it matters and I haven't yet understood if utf-8b offers another alternative for this case). Second, if decoding succeeds using the suggested filesystem encoding on Linux, then write down the encoding that we used and include that with the filename. This expands the size of our filenames significantly, but it is the only way to allow some future programmer to undo the damage of a falsely- successful decoding. Here's our whole plan: [5]. Regards, Zooko [1] http://allmydata.org [2] http://allmydata.org/pipermail/tahoe-dev/2009-March/001379.html # see the footnote of this message [3] http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html [4] http://en.wikipedia.org/wiki/Mojibake [5] http://allmydata.org/trac/tahoe/ticket/534#comment:47

How about another str-like type, a sequence of char-or-bytes? Could be called strbytes or stringwithinvalidcharacters. It would support whatever subset of str functionality makes sense / is easy to implement plus a to_escaped_str() method (that does the escaping the PEP talks about) for people who want to use regexes or other str-only stuff. Here is a description by example: os.listdir('.') -> [strbytes('normal_file'), strbytes('bad', 128, 'file')] strbytes('a')[0] -> strbytes('a') strbytes('bad', 128, 'file')[3] -> strbytes(128) strbytes('bad', 128, 'file').to_escaped_str() -> 'bad?128file' Having a separate type is cleaner than a "str that isn't exactly what it represents". And making the escaping an explicit (but rarely-needed) step would be less surprising for users. Anyway, I don't know a whole lot about this issue so there may an obvious reason this is a bad idea. On Wed, Apr 22, 2009 at 6:50 AM, "Martin v. Löwis" <martin@v.loewis.de> wrote:

On 22 Apr 2009, at 07:50, Martin v. Löwis wrote:
Forgive me if this has been covered. I've been reading this thread for a long time and still have a 100 odd replies to go... How do get a printable unicode version of these path strings if they contain none unicode data? I'm guessing that an app has to understand that filenames come in two forms unicode and bytes if its not utf-8 data. Why not simply return string if its valid utf-8 otherwise return bytes? Then in the app you check for the type for the object, string or byte and deal with reporting errors appropriately. Barry

On 29Apr2009 23:41, Barry Scott <barry@barrys-emacs.org> wrote:
Personally, I'd use repr(). One might ask, what would you expect to see if you were printing such a string?
Because it complicates the app enormously, for every app. It would be _nice_ to just call os.listdir() et al with strings, get strings, and not worry. With strings becoming unicode in Python3, on POSIX you have an issue of deciding how to get its filenames-are-bytes into a string and the reverse. One could naively map the byte values to the same Unicode code points, but that results in strings that do not contain the same characters as the user/app expects for byte values above 127. Since POSIX does not really have a filesystem level character encoding, just a user environment setting that says how the current user encodes characters into bytes (UTF-8 is increasingly common and useful, but it is not universal), it is more useful to decode filenames on the assumption that they represent characters in the user's (current) encoding convention; that way when things are displayed they are meaningful, and they interoperate well with strings made by the user/app. If all the filenames were actually encoded that way when made, that works. But different users may adopt different conventions, and indeed a user may have used ACII or and ISO8859-* coding in the past and be transitioning to something else now, so they will have a bunch of files in different encodings. The PEP uses the user's current encoding with a handler for byte sequences that don't decode to valid Unicode scaler values in a fashion that is reversible. That is, you get "strings" out of listdir() and those strings will go back in (eg to open()) perfectly robustly. Previous approaches would either silently hide non-decodable names in listdir() results or throw exceptions when the decode failed or mangle things no reversably. I believe Python3 went with the first option there. The PEP at least lets programs naively access all files that exist, and create a filename from any well-formed unicode string provided that the filesystem encoding permits the name to be encoded. The lengthy discussion mostly revolves around: - Glenn points out that strings that came _not_ from listdir, and that are _not_ well-formed unicode (== "have bare surrogates in them") but that were intended for use as filenames will conflict with the PEP's scheme - programs must know that these strings came from outside and must be translated into the PEP's funny-encoding before use in the os.* functions. Previous to the PEP they would get used directly and encode differently after the PEP, thus producing different POSIX filenames. Breakage. - Glenn would like the encoding to use Unicode scalar values only, using a rare-in-filenames character. That would avoid the issue with "outside' strings that contain surrogates. To my mind it just moves the punning from rare illegal strings to merely uncommon but legal characters. - Some parties think it would be better to not return strings from os.listdir but a subclass of string (or at least a duck-type of string) that knows where it came from and is also handily recognisable as not-really-a-string for purposes of deciding whether is it PEP-funny-encoded by direct inspection. Cheers, -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ The peever can look at the best day in his life and sneer at it. - Jim Hill, JennyGfest '95

On Thu, Apr 30, 2009, Cameron Simpson wrote:
Assuming people agree that this is an accurate summary, it should be incorporated into the PEP. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "If you think it's expensive to hire a professional to do the job, wait until you hire an amateur." --Red Adair

On approximately 4/29/2009 7:50 PM, came the following characters from the keyboard of Aahz:
I'll agree that once other misconceptions were explained away, that the remaining issues are those Cameron summarized. Thanks for the summary! Point two could be modified because I've changed my opinion; I like the invariant Cameron first (I think) explicitly stated about the PEP as it stands, and that I just reworded in another message, that the strings that are altered by the PEP in either direction are in the subset of strings that contain fake (from a strict Unicode viewpoint) characters. I still think an encoding that uses mostly real characters that have assigned glyphs would be better than the encoding in the PEP; but would now suggest that an escape character be a fake character. I'll note here that while the PEP encoding causes illegal bytes to be translated to one fake character, the 3-byte sequence that looks like the range of fake characters would also be translated to a sequence of 3 fake characters. This is 512 combinations that must be translated, and understood by the user (or at least by the programmer). The "escape sequence" approach requires changing only 257 combinations, and each altered combination would result in exactly 2 characters. Hence, this seems simpler to understand, and to manually encode and decode for debugging purposes. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

How do get a printable unicode version of these path strings if they contain none unicode data?
Define "printable". One way would be to use a regular expression, replacing all codes in a certain range with a question mark.
That would have been an alternative solution, and the one that 2.x uses for listdir. People didn't like it. Regards, Martin

On 30 Apr 2009, at 05:52, Martin v. Löwis wrote:
What I mean by printable is that the string must be valid unicode that I can print to a UTF-8 console or place as text in a UTF-8 web page. I think your PEP gives me a string that will not encode to valid UTF-8 that the outside of python world likes. Did I get this point wrong?
In our application we are running fedora with the assumption that the filenames are UTF-8. When Windows systems FTP files to our system the files are in CP-1251(?) and not valid UTF-8. What we have to do is detect these non UTF-8 filename and get the users to rename them. Having an algorithm that says if its a string no problem, if its a byte deal with the exceptions seems simple. How do I do this detection with the PEP proposal? Do I end up using the byte interface and doing the utf-8 decode myself? Barry

You are right. However, if your *only* requirement is that it should be printable, then this is fairly underspecified. One way to get a printable string would be this function def printable_string(unprintable): return "" This will always return a printable version of the input string...
That would be a bug in your FTP server, no? If you want all file names to be UTF-8, then your FTP server should arrange for that.
No, you should encode using the "strict" error handler, with the locale encoding. If the file name encodes successfully, it's correct, otherwise, it's broken. Regards, Martin

On 30 Apr 2009, at 21:06, Martin v. Löwis wrote:
Ha ha! Indeed this works, but I would have to try to turn enough of the string into a reasonable hint at the name of the file so the user can some chance of know what is being reported.
Not a bug its the lack of a feature. We use ProFTPd that has just implemented what is required. I forget the exact details - they are at work - when the ftp client asks for the FEAT of the ftp server the server can say use UTF-8. Supporting that in the server was apparently none-trivia.
O.k. I understand. Barry

Barry Scott wrote:
What do you do currently? The PEP just offers a way of reading all filenames as Unicode, if that's what you want. So what if the strings can't be encoded to normal UTF-8! The filenames aren't valid UTF-8 anyway! :-)

Martin v. Löwis wrote:
I'm proposing the following PEP for inclusion into Python 3.1. Please comment.
That seems like a much nicer solution than having parallel bytes/Unicode APIs everywhere. When the locale encoding is UTF-8, would UTF-8b also be used for the command line decoding and environment variable encoding/decoding? (the PEP currently only states that the encoding switch will be done for the file system encoding - it is silent regarding the other two system interfaces). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------

Martin v. Löwis wrote:
"correct" -> "corrected"
Would this mean that real private use characters in the file name would raise an exception? How? The UTF-8 decoder doesn't pass those bytes to any error handler.
Then the error callback for encoding would become specific to the target encoding. Would this mean that the handler checks which encoding is used and behaves like "strict" if it doesn't recognize the encoding?
Is this done by the codec, or the error handler? If it's done by the codec I don't see a reason for the "python-escape" error handler.
I thought the error handler would be used for decoding.
"and" -> "an"
Servus, Walter

"correct" -> "corrected"
Thanks, fixed.
The python-escape codec is only used/meaningful if the env encoding is not UTF-8. For any other encoding, it is assumed that no character actually maps to the private-use characters.
Why would it become specific? It can work the same way for any encoding: take U+F01xx, and generate the byte xx.
utf-8b is a new codec. However, the utf-8b codec is only used if the env encoding would otherwise be utf-8. For utf-8b, the error handler is indeed unnecessary.
It's used in both directions: for decoding, it converts \xXX to U+F01XX. For encoding, U+F01XX will trigger an error, which is then handled by the handler to produce \xXX.
"and" -> "an"
Thanks, fixed. Regards, Martin

Martin v. Löwis wrote:
Which should be true for any encoding from the pre-unicode era, but not for UTF-16/32 and variants.
If any error callback emits bytes these byte sequences must be legal in the target encoding, which depends on the target encoding itself. However for the normal use of this error handler this might be irrelevant, because those filenames that get encoded were constructed in such a way that reencoding them regenerates the original byte sequence.
Wouldn't it make more sense to be consistent how non-decodable bytes get decoded? I.e. should the utf-8b codec decode those bytes to PUA characters too (and refuse to encode then, so the error handler outputs them)?
But only for non-UTF8 encodings? Servus, Walter

On 2009-04-22 22:06, Walter Dörwald wrote:
Actually it's not even true for the pre-Unicode codecs. It was and is common for Asian companies to use company specific symbols in private areas or extended versions of CJK character sets. Microsoft even published an editor for Asian users create their own glyphs as needed: http://msdn.microsoft.com/en-us/library/cc194861.aspx Here's an overview for some US companies using such extensions: http://scripts.sil.org/cms/SCRIPTs/page.php?site_id=nrsi&item_id=VendorUseOfPUA (it's no surprise that most of these actually defined their own charsets) SIL even started a registry for the private use areas (PUAs): http://scripts.sil.org/cms/SCRIPTs/page.php?site_id=nrsi&cat_id=UnicodePUA This is their current list of assignments: http://scripts.sil.org/cms/SCRIPTs/page.php?site_id=nrsi&item_id=SILPUAassignments and here's how to register: http://scripts.sil.org/cms/SCRIPTs/page.php?site_id=nrsi&cat_id=UnicodePUA#404a261e -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 22 2009)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

Right. However, these can't appear as environment/file system encodings, because they use null bytes.
No. The whole process started with data having an *invalid* encoding in the source encoding (which, after the roundtrip, is now the target encoding). So the python-escape error handler deliberately produces byte sequences that are invalid in the environment encoding (hence the additional permission of having it produce bytes instead of characters).
Exactly so. The error handler is not of much use outside this specific scenario.
Unfortunately, that won't work. If the original encoding is UTF-8, and uses PUA characters, then, on re-encoding, it's not possible to tell whether to encode as a PUA character, or as an invalid byte. This was my original proposal a year ago, and people immediately suggested that it is not at all acceptable if there is the slightest chance of information loss. Hence the current PEP.
Right. For ease of use, the implementation will specify the error handler regardless, and the recommended use for applications will be to use the error handler regardless. For utf-8b, the error handler will never be invoked, since all input can be converted always. Regards, Martin

MRAB wrote:
I apparently have not expressed it clearly, so please help me improve the text. What I mean is this: - if the environment encoding (for lack of better name) is UTF-8, Python stops using the utf-8 codec under this PEP, and switches to the utf-8b codec. - otherwise (env encoding is not utf-8), undecodable bytes get decoded with the error handler. In this case, U+F01xx will not occur in the byte stream, since no other codec ever produces this PUA character (this is not fully true - UTF-16 may also produce PUA characters, but they can't appear as env encodings). So the case you are referring to should not happen. Regards, Martin

Martin v. Löwis wrote:
I think what's confusing me is that you talk about mapping non-decodable bytes to U+F01xx, but you also talk about decoding to half surrogate codes U+DC80..U+DCFF. If the bytes are mapped to single half surrogate codes instead of the normal pairs (low+high), then I can see that decoding could never be ambiguous and encoding could produce the original bytes.

Martin v. Löwis wrote:
I find the PEP easier to understand now. In detail I'd say that if a sequence of bytes >=0x80 is found which is not valid UTF-8, then the first byte is mapped to a half surrogate and then decoding is continued from the next byte. The only drawback I can see is if the UTF-8 bytes actually decode to a half surrogate. However, half surrogates should really only occur in UTF-16 (as I understand it), so they shouldn't be encoded in UTF-8 anyway! As for handling this case, you could either: 1. Raise an exception (which is what you're trying to avoid) or: 2. Treat it as invalid UTF-8 and map the bytes to half surrogates (encoding would produce the original bytes). I'd prefer option 2. Anyway, +1 from me.

Right: that's the rationale for UTF-8b. Encoding half surrogates violates parts of the Unicode spec, so UTF-8b is "safe".
I hadn't thought of this case, but you are right - they *are* illegal bytes, after all. Raising an exception would be useless since the whole point of this codec is to never raise unicode errors. Regards, Martin

On 06:50 am, martin@v.loewis.de wrote:
I'm proposing the following PEP for inclusion into Python 3.1. Please comment.
-1. On UNIX, character data is not sufficient to represent paths. We must, must, must continue to have a simple bytes interface to these APIs. Covering it up in layers of obscure encoding hacks will not make the problem go away, it will just make it harder to understand. To make matters worse, Linux and GNOME use the PUA for some printable characters. If you open up charmap on an ubuntu system and select "view by unicode character block", then click on "private use area", you'll see many of these. I know that Apple uses at least a few PUA codepoints for the apple logo and the propeller/option icons as well. I am still -1 on any turn-non-decodable-bytes-into-text, because it makes life harder for those of us trying to keep bytes and text straight, but if you absolutely must represent POSIX filenames as mojibake rather than bytes, the only workable solution is to use NUL as your escape character. That's the only code point which _actually_ can't show up in a filename somehow. As we discussed last time, this is what Mono does with System.IO.Path. As a bonus, it's _much_ easier to detect a NUL from random application code than to try to figure out if a string has any half-surrogates or magic PUA characters which shouldn't be interpreted according to platform PUA rules.

On 22/04/2009 14:20, glyph@divmod.com wrote:
As a hg developer, I have to concur. Keeping bytes-based APIs intact would make porting hg to py3k much, much easier. You may be able to imagine that dealing with paths correctly cross-platform on a VCS is a major PITA, and py3k is currently not helping the situation. Cheers, Dirkjan

Yeah, but IIRC a complete set of bytes APIs doesn't exist yet in py3k.
Define complete. I'm not aware of any interfaces wrt. file IO that are lacking, so which ones were you thinking of? Python doesn't currently provide a way to access environment variables and command line arguments as bytes. With the PEP, such a way would actually become available for applications that desire it. Regards, Martin

On Wed, 22 Apr 2009 at 21:21, "Martin v. L�wis" wrote:
Those are the two that I'm thinking of. I think I understand your proposal better now after your example of implementing listdir(bytes). Putting it in the PEP would probably be a good idea. I personally don't have enough practice in actually working with various encodings (or any understanding of unicode escapes) to comment further. --David

Dirkjan Ochtman <dirkjan <at> ochtman.nl> writes:
bytes-based APIs are certainly more bullet-proof under Unix, but it's the reverse under Windows. Martin's proposal aims to bridge the gap and propose something that makes text-based APIs as bullet-proof under Unix as they already are under Windows. Regards Antoine.

Dirkjan Ochtman wrote:
I find these statements contradicting: py3k *is* keeping the byte-based APIs for file names intact, so why is it not helping the situation, when this is what is needed to make porting much, much easier? Regards, Martin

I'd like to respond to this concern in three ways: 1. The PEP doesn't remove any of the existing interfaces. So if the interfaces for byte-oriented file names in 3.0 work fine for you, feel free to continue to use them. 2. Even if they were taken away (which the PEP does not propose to do), it would be easy to emulate them for applications that want them. For example, listdir could be wrapped as def listdir_b(bytestring): fse = sys.getfilesystemencoding() string = bytestring.decode(fse, "python-escape") for fn in os.listdir(string): yield fn.encoded(fse, "python-escape") 3. I still disagree that we must, must, must continue to provide these interfaces. I don't understand from the rest of your message what would *actually* break if people would use the proposed interfaces. Regards, Martin

On 07:17 pm, martin@v.loewis.de wrote:
It's good to know this. It would be good if the PEP made it clear that it is proposing an additional way to work with undecodable bytes, not replacing the existing one. For me, this PEP isn't an acceptable substitute for direct bytes-based access to command-line arguments and environment variables on UNIX. To my knowledge *those* APIs still don't exist yet. I would like it if this PEP were not used as an excuse to avoid adding them.
2. Even if they were taken away (which the PEP does not propose to do), it would be easy to emulate them for applications that want them.
I think this is a pretty clear abstraction inversion. Luckily nobody is proposing it :).
3. I still disagree that we must, must, must continue to provide these interfaces.
You do have a point; if there is a clean, defined mapping between str and bytes in terms of all path/argv/environ APIs, then we don't *need* those APIs, since we can just implement them in terms of characters. But I still think that's a bad idea, since mixing the returned strings with *other* APIs remains problematic. However, I still think the mapping you propose is problematic...
I don't understand from the rest of your message what would *actually* break if people would use the proposed interfaces.
As far as more concrete problems: the utf-8 codec currently in python 2.5 and 2.6, and 3.0 will happily encode half-surrogates, at least in the builds I have. >>> '\udc81'.encode('utf-8').decode('utf-8') '\udc81' So there's an ambiguity when passing U+DC81 to this codec: do you mean \xed\xb2\x81 or do you just mean \x81? Of course it would be possible to make UTF-8B consistent in this regard, but it is still going to interact with code that thinks in terms of actual UTF-8, and the failure mode here is very difficult to inspect. A major problem here is that it's very difficult to puzzle out whether anything *will* actually break. I might be wrong about the above for some subtlety of unicode that I don't quite understand, but I don't want to spend all day experimenting with every possible set of build options, python versions, and unicode specifications. Neither, I wager, do most people who want to call listdir(). Another specific problem: looking at the Character Map application on my desktop, U+F0126 and U+F0127 are considered printable characters. I'm not sure what they're supposed to be, exactly, but there are glyphs there. This is running Ubuntu 8.04; there may be more of these in use in more recent version of GNOME. There is nothing "private" about the "private use" area; Python can never use any of these characters for *anything*, except possibly internally in ways which are never exposed to application code, because the operating system (or window system, or libraries) might use them. If I pass a string with those printable PUA/A characters in it to listdir(), what happens? Do they get turned into bytes, do they only get turned into bytes if my filesystem encoding happens to be something other than UTF-8...? The PEP seems a bit ambiguous to me as far as how the PUA hack and the half-surrogate hack interact. I could be wrong, but it seems to me to be an either-or proposition, in which case there would be *four* bytes types in python 3.1: bytes, bytearray, str-with-PUA/A-junk, str-with- half-surrogate-junk. Detecting the difference would be an expensive and subtle affair; the simplest solution I could think of would be to use an error-prone regex. If the encoding hack used were simply NULL, then the detection would be straightforward: "if '\u0000' in thingy:". Ultimately I think I'm only -0 on all of this now, as long as we get bytes versions of environ and argv. Even if these corner-case issues aren't fixed, those of us who want to have correct handling of undecodable filenames can do so.

On 22Apr2009 21:17, Martin v. L�wis <martin@v.loewis.de> wrote: | > -1. On UNIX, character data is not sufficient to represent paths. We | > must, must, must continue to have a simple bytes interface to these | > APIs. | | I'd like to respond to this concern in three ways: | | 1. The PEP doesn't remove any of the existing interfaces. So if the | interfaces for byte-oriented file names in 3.0 work fine for you, | feel free to continue to use them. Ok. I think I had read things as supplanting byte-oriented interfaces with this exciting new strings-can-do-it-all approach. | 2. Even if they were taken away (which the PEP does not propose to do), | it would be easy to emulate them for applications that want them. | For example, listdir could be wrapped as | | def listdir_b(bytestring): | fse = sys.getfilesystemencoding() Alas, no, because there is no sys.getfilesystemencoding() at the POSIX level. It's only the user's current locale stuff on a UNIX system, and has _nothing_ to do with the filesystem because UNIX filesystems don't have encodings. In particular, because the "best" (or to my mind "misleading") you can do for this is report what the current user thinks: http://docs.python.org/library/sys.html#sys.getfilesystemencoding then there's no guarrentee that what is chosen has any releationship to what was in use when the files being consulted were made. Now, if I were writing listdir_b() I'd want to be able to do something along these lines: - set LC_ALL=C (or some equivalent mechanism) - have os.listdir() read bytes as numeric values and transcode their values _directly_ into the corresponding Unicode code points. - yield bytes( ord(c) for c in os_listdir_string ) - have os.open() et al transcode unicode code points back into bytes. i.e. a straight one-to-one mapping, using only codepoints in the range 1..255. Then I'd have some confidence that I had got hold of the bytes as they had come from the underlying UNIX system call, and a way to get those bytes _back_ to a UNIX system call intact. | string = bytestring.decode(fse, "python-escape") | for fn in os.listdir(string): | yield fn.encoded(fse, "python-escape") | | 3. I still disagree that we must, must, must continue to provide these | interfaces. I don't understand from the rest of your message what | would *actually* break if people would use the proposed interfaces. My other longer message describes what would break, if I understand your proposal. -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/

No, what? No, that algorithm would be incorrect?
So can you produce a specific example where my proposed listdir_b function would fail to work correctly? For it to work, it is not necessary that POSIX has no notion of character sets on the file system level (which is actually not true - POSIX very well recognizes the notion of character sets for file names, and recommends that you restrict yourself to the portable character set).
For this PEP, it's irrelevant. It will work even if the chosen encoding is a bad choice.
That would be an alternative approach to the same problem (and one that I think will fail more badly than the one I'm proposing). Regards, Martin

On 22Apr2009 08:50, Martin v. L�wis <martin@v.loewis.de> wrote: | File names, environment variables, and command line arguments are | defined as being character data in POSIX; Specific citation please? I'd like to check the specifics of this. | the C APIs however allow | passing arbitrary bytes - whether these conform to a certain encoding | or not. Indeed. | This PEP proposes a means of dealing with such irregularities | by embedding the bytes in character strings in such a way that allows | recreation of the original byte string. [...] So you're proposing that all POSIX OS interfaces (which use byte strings) interpret those byte strings into Python3 str objects, with a codec that will accept arbitrary byte sequences losslessly and is totally reversible, yes? And, I hope, that the os.* interfaces silently use it by default. | For most applications, we assume that they eventually pass data | received from a system interface back into the same system | interfaces. For example, and application invoking os.listdir() will | likely pass the result strings back into APIs like os.stat() or | open(), which then encodes them back into their original byte | representation. Applications that need to process the original byte | strings can obtain them by encoding the character strings with the | file system encoding, passing "python-escape" as the error handler | name. -1 This last sentence kills the idea for me, unless I'm missing something. Which I may be, of course. POSIX filesystems _do_not_ have a file system encoding. The user's environment suggests a preferred encoding via the locale stuff, and apps honouring that will make nice looking byte strings as filenames for that user. (Some platforms, like MacOSX' HFS filesystems, _do_ enforce an encoding, and a quite specific variety of UTF-8 it is; I would say they're not a full UNIX filesystem _precisely_ because they reject certain byte strings that are valid on other UNIX filesystems. What will your proposal do here? I can imagine it might cope with existing names, but what happens when the user creates a new name?) Further, different users can use different locales and encodings. If they do it in different work areas they'll be perfectly happy; if they do it in a shared area doubtless confusion will reign, but only in the users' minds, not in the filesystem. If I'm writing a general purpose UNIX tool like chmod or find, I expect it to work reliably on _any_ UNIX pathname. It must be totally encoding blind. If I speak to the os.* interface to open a file, I expect to hand it bytes and have it behave. As an explicit example, I would be just fine with python's open(filename, "w") to take a string and encode it for use, but _not_ ok for os.open() to require me to supply a string and cross my fingers and hope something sane happens when it is turned into bytes for the UNIX system call. I'm very much in favour of being able to work in strings for most purposes, but if I use the os.* interfaces on a UNIX system it is necessary to be _able_ to work in bytes, because UNIX file pathnames are bytes. If there isn't a byte-safe os.* facility in Python3, it will simply be unsuitable for writing low level UNIX tools. And I very much like using Python2 for that. Finally, I have a small python program whose whole purpose in life is to transcode UNIX filenames before transfer to a MacOSX HFS directory, because of HFS's enforced particular encoding. What approach should a Python app take to transcode UNIX pathnames under your scheme? Cheers, -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ The nice thing about standards is that you have so many to choose from; furthermore, if you do not like any of them, you can just wait for next year's model. - Andrew S. Tanenbaum

On 24Apr2009 09:27, I wrote: | If I'm writing a general purpose UNIX tool like chmod or find, I expect | it to work reliably on _any_ UNIX pathname. It must be totally encoding | blind. If I speak to the os.* interface to open a file, I expect to hand | it bytes and have it behave. As an explicit example, I would be just fine | with python's open(filename, "w") to take a string and encode it for use, | but _not_ ok for os.open() to require me to supply a string and cross | my fingers and hope something sane happens when it is turned into bytes | for the UNIX system call. | | I'm very much in favour of being able to work in strings for most | purposes, but if I use the os.* interfaces on a UNIX system it is | necessary to be _able_ to work in bytes, because UNIX file pathnames | are bytes. Just to follow up to my own words here, I would be ok for all the pure-byte stuff to be off in the "posix" module if os.* goes pure character instead of bytes or bytes+strings. -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ ... that, in a few years, all great physical constants will have been approximately estimated, and that the only occupation which will be left to men of science will be to carry these measurements to another place of decimals. - James Clerk Maxwell (1813-1879) Scientific Papers 2, 244, October 1871

Cameron Simpson wrote:
For example, on environment variables: http://opengroup.org/onlinepubs/007908799/xbd/envvar.html # For values to be portable across XSI-conformant systems, the value # must be composed of characters from the portable character set (except # NUL and as indicated below). # Environment variable names used by the utilities in the XCU # specification consist solely of upper-case letters, digits and the "_" # (underscore) from the characters defined in Portable Character Set . # Other characters may be permitted by an implementation; Or, on command line arguments: http://opengroup.org/onlinepubs/007908799/xsh/execve.html # The arguments represented by arg0, ... are pointers to null-terminated # character strings where a character string is "A contiguous sequence of characters terminated by and including the first null byte.", and a character is # A sequence of one or more bytes representing a single graphic symbol # or control code. This term corresponds to the ISO C standard term # multibyte character (multi-byte character), where a single-byte # character is a special case of a multi-byte character. Unlike the # usage in the ISO C standard, character here has no necessary # relationship with storage space, and byte is used when storage space # is discussed.
Correct.
And, I hope, that the os.* interfaces silently use it by default.
Correct.
Why is that a problem for the PEP?
See the other messages. If you want to do that, you can continue to.
Please re-read the PEP. It provides a way of being able to access any POSIX file name correctly, and still pass strings.
If there isn't a byte-safe os.* facility in Python3, it will simply be unsuitable for writing low level UNIX tools.
Why is that? The mechanism in the PEP is precisely defined to allow writing low level UNIX tools.
Compute the corresponding character strings, and use them. Regards, Martin

On 25Apr2009 14:07, "Martin v. Löwis" <martin@v.loewis.de> wrote: | Cameron Simpson wrote: | > On 22Apr2009 08:50, Martin v. Löwis <martin@v.loewis.de> wrote: | > | File names, environment variables, and command line arguments are | > | defined as being character data in POSIX; | > | > Specific citation please? I'd like to check the specifics of this. | For example, on environment variables: | http://opengroup.org/onlinepubs/007908799/xbd/envvar.html [...] | http://opengroup.org/onlinepubs/007908799/xsh/execve.html [...] Thanks. | > So you're proposing that all POSIX OS interfaces (which use byte strings) | > interpret those byte strings into Python3 str objects, with a codec | > that will accept arbitrary byte sequences losslessly and is totally | > reversible, yes? | | Correct. | | > And, I hope, that the os.* interfaces silently use it by default. | | Correct. Ok, then I'm probably good with the PEP. Though I have a quite strong desire to be able to work in bytes at need without doing multiple encode/decode steps. | > | Applications that need to process the original byte | > | strings can obtain them by encoding the character strings with the | > | file system encoding, passing "python-escape" as the error handler | > | name. | > | > -1 | > This last sentence kills the idea for me, unless I'm missing something. | > Which I may be, of course. | > POSIX filesystems _do_not_ have a file system encoding. | | Why is that a problem for the PEP? Because you said above "by encoding the character strings with the file system encoding", which is a fiction. | > If I'm writing a general purpose UNIX tool like chmod or find, I expect | > it to work reliably on _any_ UNIX pathname. It must be totally encoding | > blind. If I speak to the os.* interface to open a file, I expect to hand | > it bytes and have it behave. | | See the other messages. If you want to do that, you can continue to. | | > I'm very much in favour of being able to work in strings for most | > purposes, but if I use the os.* interfaces on a UNIX system it is | > necessary to be _able_ to work in bytes, because UNIX file pathnames | > are bytes. | | Please re-read the PEP. It provides a way of being able to access any | POSIX file name correctly, and still pass strings. | | > If there isn't a byte-safe os.* facility in Python3, it will simply be | > unsuitable for writing low level UNIX tools. | | Why is that? The mechanism in the PEP is precisely defined to allow | writing low level UNIX tools. Then implicitly it's byte safe. Clearly I'm being unclear; I mean original OS-level byte strings must be obtainable undamaged, and it must be possible to create/work on OS objects starting with a byte string as the pathname. | > Finally, I have a small python program whose whole purpose in life | > is to transcode UNIX filenames before transfer to a MacOSX HFS | > directory, because of HFS's enforced particular encoding. What approach | > should a Python app take to transcode UNIX pathnames under your scheme? | | Compute the corresponding character strings, and use them. In Python2 I've been going (ignoring checks for unchanged names): - Obtain the old name and interpret it into a str() "correctly". I mean here that I go: unicode_name = unicode(name, srcencoding) in old Python2 speak. name is a bytes string obtained from listdir() and srcencoding is the encoding known to have been used when the old name was constructed. Eg iso8859-1. - Compute the new name in the desired encoding. For MacOSX HFS, that's: utf8_name = unicodedata.normalize('NFD',unicode_name).encode('utf8') Still in Python2 speak, that's a byte string. - os.rename(name, utf8_name) Under your scheme I imagine this is amended. I would change your listdir_b() function as follows: def listdir_b(bytestring, fse=None): if fse is None: fse = sys.getfilesystemencoding() string = bytestring.decode(fse, "python-escape") for fn in os.listdir(string): yield fn.encoded(fse, "python-escape") So, internally, os.listdir() takes a string and encodes it to an _unspecified_ encoding in bytes, and opens the directory with that byte string using POSIX opendir(3). How does listdir() ensure that the byte string it passes to the underlying opendir(3) is identical to 'bytestring' as passed to listdir_b()? It seems from the PEP that "On POSIX systems, Python currently applies the locale's encoding to convert the byte data to Unicode". Your extension is to augument that by expressing the non-decodable byte sequences in a non-conflicting way for reversal later, yes? That seems to double the complexity of my example application, since it wants to interpret the original bytes in a caller-specified fashion, not using the locale defaults. So I must go: def macify(dirname, srcencoding): # I need this to reverse your encoding scheme fse = sys.getfilesystemencoding() # I'll pretend dirname is ready for use # it possibly has had to undergo the inverse of what happens inside # the loop below for fn in listdir(dirname): # listdir reads POSIX-bytes from readdir(3) # then encodes using the locale encoding, with your escape addition bytename = fn.encoded(fse, "python-escape") oldname = unicode(bytename, srcencoding) newbytename = unicodedata.normalize('NFD',unicode_name).encode('utf8') newname = newbytename.decode(fse, "python-escape") if fn != newname: os.rename(fn, newname) And I'm sure there's some os.path.join() complexity I have omitted. Is that correct? You'll note I need to recode the oldname unicode string because I don't know that fse is the same as the required target MacOSX UTF8 NFD encoding. So if my changes above are correct WRT the PEP, I grant that this is still doable in your scheme. But it would be far far easier with a bytes API. And let us not consider threads or other effects from locale changes during the loop run. I forget what was decided with the pure-bytes interfaces (out of scope for your PEP). Would there be a posix module with a bytes API? Cheers, -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ The old day of Perl's try-it-before-you-use-it are long as gone. Nowadays you can write as many as 20..100 lines of Perl without hitting a bug in the perl implementation. - Ilya Zakharevich <ilya@math.ohio-state.edu>, in the perl-porters list, 22sep1998

On Apr 22, 2009, at 2:50 AM, Martin v. Löwis wrote:
I'm proposing the following PEP for inclusion into Python 3.1. Please comment.
+1. Even if some people still want a low-level bytes API, it's important that the easy case be easy. That is: the majority of Python applications should *just work, damnit* even with not-properly-encoded- in-current-LC_CTYPE filenames. It looks like this proposal accomplishes that, and does so in a relatively nice fashion. James

On Wed, Apr 22, 2009 at 8:50 AM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
Is the second part of this actually true? My understanding may be flawed, but surely all Unicode data can be converted to and from bytes using UTF-8? Obviously not all byte sequences are valid UTF-8, but this doesn't prevent one from creating an arbitrary Unicode string using "utf-8 bytes".decode("utf-8"). Given this, can't people who must have access to all files / environment data just use the bytes interface? Disclosure: My gut reaction is that the solution described in the PEP is a hack, but I'm hardly a character encoding expert. My feeling is that the correct solution is to either standardise on the bytes interface as the lowest common denominator, or to add a Path type (and I guess an EnvironmentalData type) and use the new type to attempt to hide the differences. Schiavo Simon

On approximately 4/24/2009 12:59 AM, came the following characters from the keyboard of Simon Cross:
Oh clearly it is a hack. The right solution of a Path type (and friends) was discarded in earlier discussion, because it would impact too much existing code. The use of bytes would be annoying in the context of py3, where things that you want to display are in str (Unicode). So there is no solution that allows the use of str, and the robustness of bytes, and is 100% compatible with existing practice. Hence the desire is to find a hack that is "good enough". At least, that is my understanding and synopsis. I never saw MvL's original message with the PEP delivered to my mailbox, but some of the replies came there, so I found and extensively replied to it using the Google group / usenet. My reply never showed up here and no one has commented on it either... Should I repost via the mailing list? I think so... I'll just paste it in here, with one tweak I noticed after I sent it fixed... (Sorry Simon, but it is still the same thread, anyway.) (Sorry to others, if my original reply was seen, and just wasn't worth replying to.) On Apr 21, 11:50 pm, "Martin v. Löwis" <mar...@v.loewis.de> wrote:
I'm proposing the following PEP for inclusion into Python 3.1. Please comment.
Basically the scheme doesn't work. Aside from that, it is very close. There are tons of encoding schemes that could work... they don't have to include half-surrogates or bytes. What they have to do, is make sure that they are uniformly applied to all appropriate strings. The problem with this, and other preceding schemes that have been discussed here, is that there is no means of ascertaining whether a particular file name str was obtained from a str API, or was funny- decoded from a bytes API... and thus, there is no means of reliably ascertaining whether a particular filename str should be passed to a str API, or funny-encoded back to bytes. The assumption in the 2nd Discussion paragraph may hold for a large percentage of cases, maybe even including some number of 9s, but it is not guaranteed, and cannot be enforced, therefore there are cases that could fail. Whether those failure cases are a concern or not is an open question. Picking a character (I don't find U+F01xx in the Unicode standard, so I don't know what it is) that is obscure, and unlikely to be used in "real" file names, might help the heuristic nature of the encoding and decoding avoid most conflicts, but provides no guarantee that data puns will not occur in practice. Today's obscure character is tomorrows commonly used character, perhaps. Someone not on this list may be happily using that character for their own nefarious, incompatible purpose. As I realized in the email-sig, in talking about decoding corrupted headers, there is only one way to guarantee this... to encode _all_ character sequences, from _all_ interfaces. Basically it requires reserving an escape character (I'll use ? in these examples -- yes, an ASCII question mark -- happens to be illegal in Windows filenames so all the better on that platform, but the specific character doesn't matter... avoiding / \ and . is probably good, though). So the rules would be, when obtaining a file name from the bytes OS interface, that doesn't properly decode according to UTF-8, decode it by placing a ? at the beginning, then for each decodable UTF-8 sequence, add a Unicode character -- unless the character is ?, in which case you add two ??, and for each non-decodable byte sequence, place a ? and two hex digits, or a ? and a half surrogate code, or a ? and whatever gibberish you like. Two hex digits are fine by me, and will serve for this discussion. ALSO, when obtaining a file name from the str OS interfaces, encode it too... if it contains any ?, then place a ? at the front, and then any other ? in the name must be doubled. Then you have a string that can/must be encoded to be used on either str or bytes OS interfaces... or any other interfaces that want str or bytes... but whichever they want, you can do a decode, or determine that you can't, into that form. The encode and decode functions should be available for coders to use, that code to external interfaces, either OS or 3rd party packages, that do not use this encoding scheme. This encoding scheme would be used throughout all Python APIs (most of which would need very little change to accommodate it). However, programs would have to keep track of whether they were dealing with encoded or unencoded strings, if they use both types in their program (an example, is hard-coded file names or file name parts). The initial ? is not strictly necessary for this scheme to work, but I think it would be a good flag to the user that this name has been altered. This scheme does not depend on assumptions about the use of file names. This scheme would be enhanced if the file name APIs returned a subtype of str for the encoded names, but that should be considered only a hint, not a requirement. When encoding file name strings to pass to bytes APIs, the ? followed by two hex digits would be converted to a byte. Leading ? would be dropped, and ?? would convert to ?. I don't believe failures are possible when encoding to bytes. When encoding file name strings to pass to str APIs, the discovery of ? followed by two hex digits would raise an exception, the file name is not acceptable to a str API. However, leading ? would be dropped, and ?? would convert to ?, and if no ? followed by two hex digits were found, the file name would be successfully converted for use on the str API. Note that not even on Unix/Posix is it particularly easy nor useful to place a ? into file names from command lines due to shell escapes, etc. The use of ? in file names also interferes with easy ability to specifically match them in globs, etc. Anything short of such an encoding of both types of interfaces, such that it is known that all python-manipulated filenames will be encoded, will have data puns that provide a potential for failure in edge cases. Note that in this scheme, no file names that are fully Unicode and do not contain ? characters are altered by the decoding or the encoding process. That will probably reach quite a few 9s of likelihood that the scheme will go unnoticed by most people and programs and filenames. But the scheme will work reliably if implemented correctly and completely, and will have no edge cases of failure due to not having data puns. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

On Fri, Apr 24, 2009 at 11:22 AM, Glenn Linderman <v+python@g.nevcal.com> wrote:
What about keeping the bytes interface (utf-8 encoded Unicode on Windows) and adding a Path type (and friends) interface that mirrors it?
(Sorry Simon, but it is still the same thread, anyway.)
Python discussions do seem to womble through a rather large set of mailing lists and news groups. :) Schiavo Simon

Why is it necessary that you are able to make this distinction?
Picking a character (I don't find U+F01xx in the Unicode standard, so I don't know what it is)
It's a private use area. It will never carry an official character assignment.
I think you'll have to write an alternative PEP if you want to see something like this implemented throughout Python. Regards, Martin

On approximately 4/25/2009 5:22 AM, came the following characters from the keyboard of Martin v. Löwis:
It is necessary that programs (not me) can make the distinction, so that it knows whether or not to do the funny-encoding or not. If a name is funny-decoded when the name is accessed by a directory listing, it needs to be funny-encoded in order to open the file.
I know that U+F0000 - U+FFFFF is a private use area. I don't find a definition of U+F01xx to know what the notation means. Are you picking a particular character within the private use area, or a particular range, or what?
I'm certainly not experienced enough in Python development processes or internals to attempt such, as yet. But somewhere in 25 years of programming, I picked up the knowledge that if you want to have a 1-to-1 reversible mapping, you have to avoid data puns, mappings of two different data values into a single data value. Your PEP, as first written, didn't seem to do that... since there are two interfaces from which to obtain data values, one performing a mapping from bytes to "funny invalid" Unicode, and the other performing no mapping, but accepting any sort of Unicode, possibly including "funny invalid" Unicode, the possibility of data puns seems to exist. I may be misunderstanding something about the use cases that prevent these two sources of "funny invalid" Unicode from ever coexisting, but if so, perhaps you could point it out, or clarify the PEP. I'll try to reread it again... could you post a URL to the most up-to-date version of the PEP, since I haven't seen such appear here, and the version I found via a Google search seems to be the original? -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

On approximately 4/27/2009 12:42 PM, came the following characters from the keyboard of Martin v. Löwis:
So you only need 128 code points, so there is something else unclear. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

Glenn Linderman wrote:
(please understand that this is history now, since the PEP has stopped using PUA characters). No. You seem to assume that all bytes < 128 decode successfully always. I believe this assumption is wrong, in general: py> "\x1b$B' \x1b(B".decode("iso-2022-jp") #2.x syntax Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'iso2022_jp' codec can't decode bytes in position 3-4: illegal multibyte sequence All bytes are below 128, yet it fails to decode. Regards, Martin

On Apr 27, 2009, at 11:35 PM, Martin v. Löwis wrote:
Surely nobody uses iso2022 as an LC_CTYPE encoding. That's expressly forbidden by POSIX, if I'm not mistaken...and I can't see how it would work, considering that it uses all the bytes from 0x20-0x7f, including 0x2f ("/"), to represent non-ascii characters. Hopefully it can be assumed that your locale encoding really is a non- overlapping superset of ASCII, as is required by POSIX... I'm a bit scared at the prospect that U+DCAF could turn into "/", that just screams security vulnerability to me. So I'd like to propose that only 0x80-0xFF <-> U+DC80-U+DCFF should ever be allowed to be encoded/decoded via the error handler. James

James Y Knight wrote:
Hopefully it can be assumed that your locale encoding really is a non-overlapping superset of ASCII, as is required by POSIX...
Can you please point to the part of the POSIX spec that says that such overlapping is forbidden?
It would be actually U+DC2f that would turn into /. I'm happy to exclude that range from the mapping if POSIX really requires an encoding not to be overlapping with ASCII. Regards, Martin

On Apr 28, 2009, at 2:50 AM, Martin v. Löwis wrote:
I can't find it...I would've thought it would be on this page: http://opengroup.org/onlinepubs/007908775/xbd/charset.html but it's not (at least, not obviously). That does say (effectively) that all encodings must be supersets of ASCII and use the same codepoints, though. However, ISO-2022 being inappropriate for LC_CTYPE usage is the entire reason why EUC-JP was created, so I'm pretty sure that it is in fact inappropriate, and I cannot find any evidence of it ever being used on any system. From http://en.wikipedia.org/wiki/EUC-JP: "To get the EUC form of an ISO-2022 character, the most significant bit of each 7-bit byte of the original ISO 2022 codes is set (by adding 128 to each of these original 7-bit codes); this allows software to easily distinguish whether a particular byte in a character string belongs to the ISO-646 code or the ISO-2022 (EUC) code." Also: http://www.cl.cam.ac.uk/~mgk25/ucs/iso2022-wc.html
Yes, I meant to say DC2F, sorry for the confusion.
I'm happy to exclude that range from the mapping if POSIX really requires an encoding not to be overlapping with ASCII.
I think it has to be excluded from mapping in order to not introduce security issues. However... There's also SHIFT-JIS to worry about...which apparently some people actually want to use as their default encoding, despite it being broken to do so. RedHat apparently refuses to provide it as a locale charset (due to its brokenness), and it's also not available by default on my Debian system. People do unfortunately seem to actually use it in real life. https://bugzilla.redhat.com/show_bug.cgi?id=136290 So, I'd like to propose this: The "python-escape" error handler when given a non-decodable byte from 0x80 to 0xFF will produce values of U+DC80 to U+DCFF. When given a non- decodable byte from 0x00 to 0x7F, it will be converted to U+0000-U +007F. On the encoding side, values from U+DC80 to U+DCFF are encoded into 0x80 to 0xFF, and all other characters are treated in whatever way the encoding would normally treat them. This proposal obviously works for all non-overlapping ASCII supersets, where 0x00 to 0x7F always decode to U+00 to U+7F. But it also works for Shift-JIS and other similar ASCII-supersets with overlaps in trailing bytes of a multibyte sequence. So, a sequence like "\x81\xFD".decode("shift-jis", "python-escape") will turn into u"\uDC81\u00fd". Which will then properly encode back into "\x81\xFD". The character sets this *doesn't* work for are: ebcdic code pages (obviously completely unsuitable for a locale encoding on unix), iso2022-* (covered above), and shift-jisx0213 (because it has replaced \ with yen, and - with overline). If it's desirable to work with shift_jisx0213, a modification of the proposal can be made: Change the second sentence to: "When given a non- decodable byte from 0x00 to 0x7F, that byte must be the second or later byte in a multibyte sequence. In such a case, the error handler will produce the encoding of that byte if it was standing alone (thus in most encodings, \x00-\x7f turn into U+00-U+7F)." It sounds from https://bugzilla.novell.com/show_bug.cgi?id=162501 like some people do actually use shift_jisx0213, unfortunately. James

James Y Knight wrote:
I've been thinking of "python-escape" only in terms of UTF-8, the only encoding mentioned in the PEP. In UTF-8, bytes 0x00 to 0x7F are decodable. But if you're talking about using it with other encodings, eg shift-jisx0213, then I'd suggest the following: 1. Bytes 0x00 to 0xFF which can't normally be decoded are decoded to half surrogates U+DC00 to U+DCFF. 2. Bytes which would have decoded to half surrogates U+DC00 to U+DCFF are treated as though they are undecodable bytes. 3. Half surrogates U+DC00 to U+DCFF which can be produced by decoding are encoded to bytes 0x00 to 0xFF. 4. Codepoints, including half surrogates U+DC00 to U+DCFF, which can't be produced by decoding raise an exception. I think I've covered all the possibilities. :-)

On approximately 4/28/2009 11:55 AM, came the following characters from the keyboard of MRAB:
UTF-8 is only mentioned in the sense of having special handling for re-encoding; all the other locales/encodings are implicit. But I also went down that path to some extent.
This makes 256 different escape codes.
2. Bytes which would have decoded to half surrogates U+DC00 to U+DCFF are treated as though they are undecodable bytes.
This provides escaping for the 256 different escape codes, which is lacking from the PEP.
3. Half surrogates U+DC00 to U+DCFF which can be produced by decoding are encoded to bytes 0x00 to 0xFF.
This reverses the escaping.
4. Codepoints, including half surrogates U+DC00 to U+DCFF, which can't be produced by decoding raise an exception.
This is confusing. Did you mean "excluding" instead of "including"?
I think I've covered all the possibilities. :-)
You might have. Seems like there could be a simpler scheme, though... 1. Define an escape codepoint. It could be U+003F or U+DC00 or U+F817 or pretty much any defined Unicode codepoint outside the range U+0100 to U+01FF (see rule 3 for why). Only one escape codepoint is needed, this is easier for humans to comprehend. 2. When the escape codepoint is decoded from the byte stream for a bytes interface or found in a str on the str interface, double it. 3. When an undecodable byte 0xPQ is found, decode to the escape codepoint, followed by codepoint U+01PQ, where P and Q are hex digits. 4. When encoding, a sequence of two escape codepoints would be encoded as one escape codepoint, and a sequence of the escape codepoint followed by codepoint U+01PQ would be encoded as byte 0xPQ. Escape codepoints not followed by the escape codepoint, or by a codepoint in the range U+0100 to U+01FF would raise an exception. 5. Provide functions that will perform the same decoding and encoding as would be done by the system calls, for both bytes and str interfaces. This differs from my previous proposal in three ways: A. Doesn't put a marker at the beginning of the string (which I said wasn't necessary even then). B. Allows for a choice of escape codepoint, the previous proposal suggested a specific one. But the final solution will only have a single one, not a user choice, but an implementation choice. C. Uses the range U+0100 to U+01FF for the escape codes, rather than U+0000 to U+00FF. This avoids introducing the NULL character and escape characters into the decoded str representation, yet still uses characters for which glyphs are commonly available, are non-combining, and are easily distinguishable one from another. Rationale: The use of codepoints with visible glyphs makes the escaped string friendlier to display systems, and to people. I still recommend using U+003F as the escape codepoint, but certainly one with a typcially visible glyph available. This avoids what I consider to be an annoyance with the PEP, that the codepoints used are not ones that are easily displayed, so endecodable names could easily result in long strings of indistinguishable substitution characters. It, like MRAB's proposal, also avoids data puns, which is a major problem with the PEP. I consider this proposal to be easier to understand than MRAB's proposal, or the PEP, because of the single escape codepoint and the use of visible characters. This proposal, like my initial one, also decodes and encodes (just the escape codes) values on the str interfaces. This is necessary to avoid data puns on systems that provide both types of interfaces. This proposal could be used for programs that use str values, and easily migrates to a solution that provides an object that provides an abstraction for system interfaces that have two forms. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

Glenn Linderman wrote:
Speaking personally, I won't call them 'escape codes'. I'd use the term 'escape code' to mean a character that changes the interpretation of the next character(s).
Perhaps I should've said "Any codepoint which can't be produced by decoding should raise an exception". For example, decoding with UTF-8b will never produce U+DC00, therefore attempting to encode U+DC00 should raise an exception and not produce 0x00.
Perhaps the escape character should be U+005C. ;-)

On approximately 4/28/2009 2:01 PM, came the following characters from the keyboard of MRAB:
OK, I won't be offended if you don't call them 'escape codes'. :) But what else to call them? My use of that term is a bit backwards, perhaps... what happens is that because these 256 half surrogates are used to decode otherwise undecodable bytes, they themselves must be "escaped" or translated into something different, when they appear in the byte sequence. The process described reserves a set of codepoints for use, and requires that that same set of codepoints be translated using a similar mechanism to avoid their untranslated appearance in the resulting str. Escape codes have the same sort of characteristic... by replacing their normal use for some other use, they must themselves have a replacement. Anyway, I think we are communicating successfully.
Yes, your rephrasing is clearer, regarding your intention.
Decoding with UTF-8b might never produce U+DC00, but then again, it won't handle the random byte string, either.
Windows users everywhere would love you for that one :)
-- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

Glenn Linderman a écrit :
3. When an undecodable byte 0xPQ is found, decode to the escape codepoint, followed by codepoint U+01PQ, where P and Q are hex digits.
The problem with this strategy is: paths are often sliced, so your 2 codepoints could get separated. The good thing with the PEP's strategy is that 1 character stays 1 character. Baptiste

On approximately 4/29/2009 12:38 AM, came the following characters from the keyboard of Baptiste Carvello:
Except for half-surrogates that are in the file names already, which get converted to 3 characters. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

On approximately 4/28/2009 10:53 AM, came the following characters from the keyboard of James Y Knight:
It would seem from the definition of ISO-2022 that what it calls "escape sequences" is in your POSIX spec called "locking-shift encoding". Therefore, the second bullet item under the "Character Encoding" heading prohibits use of ISO-2022, for whatever uses that document defines (which, since you referenced it, I assume means locales, and possibly file system encodings, but I'm not familiar with the structure of all the POSIX standards documents). A locking-shift encoding (where the state of the character is determined by a shift code that may affect more than the single character following it) cannot be defined with the current character set description file format. Use of a locking-shift encoding with any of the standard utilities in the XCU specification or with any of the functions in the XSH specification that do not specifically mention the effects of state-dependent encoding is implementation-dependent.
Why is that obvious? The only thing I saw that could exclude EBCDIC would be the requirement that the codes be positive in a char, but on a system where the C compiler treats char as unsigned, EBCDIC would qualify. Of course, the use of EBCDIC would also restrict the other possible code pages to those derived from EBCDIC (rather than the bulk of code pages that are derived from ASCII), due to: If the encoded values associated with each member of the portable character set are not invariant across all locales supported by the implementation, the results achieved by an application accessing those locales are unspecified.
-- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

I think it has to be excluded from mapping in order to not introduce security issues.
I think you are right. I have now excluded ASCII bytes from being mapped, effectively not supporting any encodings that are not ASCII compatible. Does that sound ok? Regards, Martin

On Apr 30, 2009, at 5:42 AM, Martin v. Löwis wrote:
Yes. The practical upshot of this is that users who brokenly use "ja_JP.SJIS" as their locale (which, note, first requires editing some files in /var/lib/locales manually to enable its use..) may still have python not work with invalid-in-shift-jis filenames. Since that locale is widely recognized as a bad idea to use, and is not supported by any distros, it certainly doesn't bother me that it isn't 100% supported in python. It seems like the most common reason why people want to use SJIS is to make old pre-unicode apps work right in WINE -- in which case it doesn't actually affect unix python at all. I'd personally be fine with python just declaring that the filesystem- encoding will *always* be utf-8b and ignore the locale...but I expect some other people might complain about that. Of course, application authors can decide to do that themselves by calling sys.setfilesystemencoding('utf-8b') at the start of their program. James

James Y Knight writes:
Mounting external drives, especially USB memory sticks which tend to be FAT-initialized by the manufacturers, is another common case. But I don't understand why PEP 383 needs to care at all.

On approximately 4/27/2009 8:35 PM, came the following characters from the keyboard of Martin v. Löwis:
Yes, but having found the latest PEP finally (at least I hope the one at python.org is the latest, it has quit using PUA anyway), I confirm it is history. But the same issue applies to the range of half-surrogates.
Indeed, that was the missing piece. I'd forgotten about the encodings that use escape sequences, rather than UTF-8, and DBCS. I don't think those encodings are permitted by POSIX file systems, but I suppose they could sneak in via Environment variable values, and the like. The switch from PUA to half-surrogates does not resolve the issues with the encoding not being a 1-to-1 mapping, though. The very fact that you think you can get away with use of lone surrogates means that other people might, accidentally or intentionally, also use lone surrogates for some other purpose. Even in file names. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

On Mon, 2009-04-27 at 22:25 -0700, Glenn Linderman wrote:
This may already have been discussed, and if so I apologise for the for the noise. Does the PEP take into consideration the normalising behaviour of Mac OSX ? We've had some ongoing challenges in bzr related to this with bzr. -Rob

Does the PEP take into consideration the normalising behaviour of Mac OSX ? We've had some ongoing challenges in bzr related to this with bzr.
No, that's completely out of scope, AFAICT. I don't even know what the issues are, so I'm not able to propose a solution, at the moment. Regards, Martin

2009/4/28 Glenn Linderman <v+python@g.nevcal.com>:
It does solve this issue, because (unlike e.g. U+F01FF) '\udcff' is not a valid Unicode character (not a character at all, really) and the only way you can put this in a POSIX filename is if you use a very lenient UTF-8 encoder that gives you b'\xed\xb3\xbf'. Since this byte sequence doesn't represent a valid character when decoded with UTF-8, it should simply be considered an invalid UTF-8 sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not* '\udcff'). Martin: maybe the PEP should say this explicitly? Note that the round-trip works without ambiguities between '\udcff' in the filename: b'\xed\xb3\xbf' -> '\udced\udcb3\udcbf' -> b'\xed\xb3\xbf' and b'\xff' in the filename, decoded by Python to '\udcff': b'\xff' -> '\udcff' -> b'\xff' -- Lino Mastrodomenico

On approximately 4/28/2009 6:01 AM, came the following characters from the keyboard of Lino Mastrodomenico:
Wrong. An 8859-1 locale allows any byte sequence to placed into a POSIX filename. And while U+DCFF is illegal alone in Unicode, it is not illegal in Python str values. And from my testing, Python 3's current UTF-8 encoder will happily provide exactly the bytes value you mention when given U+DCFF.
Others have made this suggestion, and it is helpful to the PEP, but not sufficient. As implemented as an error handler, I'm not sure that the b'\xed\xb3\xbf' sequence would trigger the error handler, if the UTF-8 decoder is happy with it. Which, in my testing, it is. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

On 27Apr2009 00:07, Glenn Linderman <v+python@g.nevcal.com> wrote:
I would say this isn't so. It's important that programs know if they're dealing with strings-for-filenames, but not that they be able to figure that out "a priori" if handed a bare string (especially since they can't:-)
Hmm. I had thought that legitimate unicode strings already get transcoded to bytes via the mapping specified by sys.getfilesystemencoding() (the user's locale). That already happens I believe, and Martin's scheme doesn't change this. He's just funny-encoding non-decodable byte sequences, not the decoded stuff that surrounds them. So it is already the case that strings get decoded to bytes by calls like open(). Martin isn't changing that. I suppose if your program carefully constructs a unicode string riddled with half-surrogates etc and imagines something specific should happen to them on the way to being POSIX bytes then you might have a problem... I think the advantage to Martin's choice of encoding-for-undecodable-bytes is that it _doesn't_ use normal characters for the special bits. This means that _all_ normal characters are left unmangled un both "bare" and "funny-encoded" strings. Because of that, I now think I'm -1 on your "use printable characters for the encoding". I think presentation of the special characters _should_ look bogus in an app (eg little rectangles or whatever in a GUI); it's a fine flashing red light to the user. Also, by avoiding reuse of legitimate characters in the encoding we can avoid your issue with losing track of where a string came from; legitimate characters are currently untouched by Martin's scheme, except for the normal "bytes<->string via the user's locale" translation that must already happen, and there you're aided by byets and strings being different types.
Please elucidate the "second source" of strings. I'm presuming you mean strings egenrated from scratch rather than obtained by something like listdir(). Given such a string with "funny invalid" stuff in it, and _absent_ Martin's scheme, what do you expect the source of the strings to _expect_ to happen to them if passed to open()? They still have to be converted to bytes at the POSIX layer anyway. Cheers, -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ Heaven could change from chocolate to vanilla without violating perfection. - arromdee@jyusenkyou.cs.jhu.edu (Ken Arromdee)

On approximately 4/27/2009 2:14 PM, came the following characters from the keyboard of Cameron Simpson:
So you agree they can't... that there are data puns. (OK, you may not have thought that through)
So assume a non-decodable sequence in a name. That puts us into Martin's funny-decode scheme. His funny-decode scheme produces a bare string, indistinguishable from a bare string that would be produced by a str API that happens to contain that same sequence. Data puns. So when open is handed the string, should it open the file with the name that matches the string, or the file with the name that funny-decodes to the same string? It can't know, unless it knows that the string is a funny-decoded string or not.
So it is already the case that strings get decoded to bytes by calls like open(). Martin isn't changing that.
I thought the process of converting strings to bytes is called encoding. You seem to be calling it decoding?
Right. Or someone else's program does that. I only want to use Unicode file names. But if those other file names exist, I want to be able to access them, and not accidentally get a different file.
Whether the characters used for funny decoding are normal or abnormal, unless they are prevented from also appearing in filenames when they are obtained from or passed to other APIs, there is the possibility that the funny-decoded name also exists in the filesystem by the funny-decoded name... a data pun on the name. Whether the characters used for funny decoding are normal or abnormal, if they are not prevented from also appearing in filenames when they are obtained from or passed to other APIs, then in order to prevent data puns, *all* names must be passed through the decoder, and the decoder must perform a 1-to-1 reversible mapping. Martin's funny-decode process does not perform a 1-to-1 reversible mapping (unless he's changed it from the version of the PEP I found to read). This is why some people have suggested using the null character for the decoding, because it and / can't appear in POSIX file names, but everything else can. But that makes it really hard to display the funny-decoded characters.
The reason I picked a ASCII printable character is just to make it easier for humans to see the encoding. The scheme would also work with a non-ASCII non-printable character... but I fail to see how that would help a human compare the strings on a display of file names. Having a bunch of abnormal characters in a row, displayed using a single replacement glyph, just makes an annoying mess in the file open dialog.
There are abnormal characters, but there are no illegal characters. NTFS permits any 16-bit "character" code, including abnormal ones, including half-surrogates, and including full surrogate sequences that decode to PUA characters. POSIX permits all byte sequences, including things that look like UTF-8, things that don't look like UTF-8, things that look like half-surrogates, and things that look like full surrogate sequences that decode to PUA characters. So whether the decoding/encoding scheme uses common characters, or uncommon characters, you still have the issue of data puns, unless you use a 1-to-1 transformation, that is reversible. With ASCII strings, I think no one questions that you need to escape the escape characters. C uses \ as an escape character... Everyone understands that if you want to use a \ in a C string, you have to use \\ instead... and that scheme has escaped the boundaries of C to other use cases. But it seems that you think that if we could just find one more character that no one else uses, that we wouldn't have to escape it.... and that could be true, but there aren't any characters that no one else uses. So whatever character (and a range makes it worse) you pick, someone else uses it. So in order for the scheme to work, you have to escape the escape character(s), even in names that wouldn't otherwise need to be funny-decoded.
POSIX has byte APIs for strings, that's one source, that is most under discussion. Windows has both bytes and 16-bit APIs for strings... the 16-bit APIs are generally mapped directly to UTF-16, but are not checked for UTF-16 validity, so all of Martin's funny-decoded files could be used for Windows file names on the 16-bit APIs. And yes, strings can be generated from scratch.
There is a fine encoding scheme that can take any str and encode to bytes: UTF-8. The problem is that UTF-8 doesn't work to take any byte sequence and decode to str, and that means that special handling has to happen when such byte sequences are encountered. But there is no str that can be generated that can't be generated in other ways, which would be properly encoded to a different byte sequence. Hence there are data puns, no 1-to-1 mapping. Hence it seems obvious to me that the only complete solution is to have an escape character, and ensure that all strings are decoded and encoded. As soon as you have an escape character, then you can decode anything into displayable, standard, Unicode, and you can create the reverse encoding unambiguously. Without an escape character, you just have a heuristic that will work sometimes, and break sometimes. If you believe non-UTF-8-decodable byte sequences are rare, you can ignore them. That's what we do now, but people squawk. If you believe that you can invent an encoding that has data puns, and that because of the character or characters involved are rare, that the problems that result can be ignored, fine... but people will squawk when they hit the problem... I'm just trying to squawk now, to point out that this is complexity for complexities sake, it adds complexity to trade one problem for a different problem, under the belief that the other problem is somehow rarer than the first. And maybe it is, today. I'd much rather have a solution that actually solves the problem. If you don't like ? as the escape character, then pick U+10F01, and anytime a U+10F01 is encountered in a file name, double it. And anytime there is an undecodable byte sequence, emit U+10F01, and then U+80 through U+FF as a subsequent character for the first byte in the undecodable sequence, and restart the decoder with the next byte. That'll work too. But use of rare, abnormal characters to take the place of undecodable bytes can never work, because of data puns, and valid use of the rare, abnormal characters. Someone suggested treating the byte sequences of the rare, abnormal characters as undecodable bytes, and decoding them using the same substitution rules. That would work too, if applied consistently, because then the rare, abnormal characters would each be escaped. But having 128 escape characters seems more complex than necessary, also. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

On 27Apr2009 18:15, Glenn Linderman <v+python@g.nevcal.com> wrote:
I agree you can't examine a string and know if it came from the os.* munging or from someone else's munging. I totally disagree that this is a problem. There may be puns. So what? Use the right strings for the right purpose and all will be well. I think what is missing here, and missing from Martin's PEP, is some utility functions for the os.* namespace. PROPOSAL: add to the PEP the following functions: os.fsdecode(bytes) -> funny-encoded Unicode This is what os.listdir() does to produce the strings it hands out. os.fsencode(funny-string) -> bytes This is what open(filename,..) does to turn the filename into bytes for the POSIX open. os.pathencode(your-string) -> funny-encoded-Unicode This is what you must do to a de novo string to turn it into a string suitable for use by open. Importantly, for most strings not hand crafted to have weird sequences in them, it is a no-op. But it will recode your puns for survival. and for me, I would like to see: os.setfilesystemencoding(coding) Currently os.getfilesystemencoding() returns you the encoding based on the current locale, and (I trust) the os.* stuff encodes on that basis. setfilesystemencoding() would override that, unless coding==None in what case it reverts to the former "use the user's current locale" behaviour. (We have locale "C" for what one might otherwise expect None to mean:-) The idea here is to let to program control the codec used for filenames for special purposes, without working indirectly through the locale.
See my proposal above. Does it address your concerns? A program still must know the providence of the string, and _if_ you're working with non-decodable sequences in a names then you should transmute then into the funny encoding using the os.pathencode() function described above. In this way the punning issue can be avoided. _Lacking_ such a function, your punning concern is valid.
True. open() should always expect a funny-encoded name.
My head must be standing in the wrong place. Yes, I probably mean encoding here. I'm trying to accompany these terms with little pictures like "string->bytes" to avoid confusion.
Point taken. And I think addressed by the utility function proposed above. [...snip normal versus odd chars for the funny-encoding ...]
I though half-surrogates were illegal in well formed Unicode. I confess to being weak in this area. By "legitimate" above I meant things like half-surrogates which, like quarks, should not occur alone?
Sure. I'm not really talking about what filesystem will accept at the native layer, I was talking in the python funny-encoded space. [..."escaping is necessary"... I agree...]
These are existing file objects, I'll take them as source 1. They get encoded for release by os.listdir() et al.
And yes, strings can be generated from scratch.
I take this to be source 2. I think I agree with all the discussion that followed, and think the real problem is lack of utlities functions to funny-encode source 2 strings for use. hence the proposal above. Cheers, -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ Be smart, be safe, be paranoid. - Ryan Cousineau, courier@compdyn.com DoD#863, KotRB, KotKWaWCRH

2009/4/27 Cameron Simpson <cs@zip.com.au>:
Time machine! http://docs.python.org/dev/py3k/library/sys.html#sys.setfilesystemencoding -- Regards, Benjamin

On 27Apr2009 21:58, Benjamin Peterson <benjamin@python.org> wrote: | 2009/4/27 Cameron Simpson <cs@zip.com.au>: | > PROPOSAL: add to the PEP the following functions: [...] | > and for me, I would like to see: | > os.setfilesystemencoding(coding) | > | > Currently os.getfilesystemencoding() returns you the encoding based on | > the current locale, and (I trust) the os.* stuff encodes on that basis. | > setfilesystemencoding() would override that, unless coding==None in what | > case it reverts to the former "use the user's current locale" behaviour. | > (We have locale "C" for what one might otherwise expect None to mean:-) | | Time machine! http://docs.python.org/dev/py3k/library/sys.html#sys.setfilesystemencoding How embarrassing. I thought I'd looked. It doesn't have the None->return-to-default mode, and I'd like to see the word "overwritten" replaced by "overidden". And of course if Martin's PEP gets adopted then the "e.g." cleause needs replacing:-) -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ Do not taunt Happy Fun Coder.

On approximately 4/27/2009 7:11 PM, came the following characters from the keyboard of Cameron Simpson:
Seems like one would also desire os.pathdecode to do the reverse. And also versions that take or produce bytes from funny-encoded strings. Then, if programs were re-coded to perform these transformations on what you call de novo strings, then the scheme would work. But I think a large part of the incentive for the PEP is to try to invent a scheme that intentionally allows for the puns, so that programs do not need to be recoded in this manner, and yet still work. I don't think such a scheme exists. If there is going to be a required transformation from de novo strings to funny-encoded strings, then why not make one that people can actually see and compare and decode from the displayable form, by using displayable characters instead of lone surrogates?
"Illegal" just means violating the accepted rules. In this case, the accepted rules are those enforced by the file system (at the bytes or str API levels), and by Python (for the str manipulations). None of those rules outlaw lone surrogates. Hence, while all of the systems under discussion can handle all Unicode characters in one way or another, none of them require that all Unicode rules are followed. Yes, you are correct that lone surrogates are illegal in Unicode. No, none of the accepted rules for these systems require Unicode.
One variation of source 2 is reading output from other programs, such as ls (POSIX) or dir (Windows).
I think we understand each other now. I think your proposal could work, Cameron, although when recoding applications to use your proposal, I'd find it easier to use the "file name object" that others have proposed. I think that because either your proposal or the object proposals require recoding the application, that they will not be accepted. I think that because the PEP 383 allows data puns, that it should not be accepted in its present form. I think your if your proposal is accepted, that it then becomes possible to use an encoding that uses visible characters, which makes it easier for people to understand and verify. An encoding such as the one I suggested, but perhaps using a more obscure character, if there is one, but yet doesn't violate true Unicode. I think it should transform all data, from str and bytes interfaces, and produce only str values containing conforming Unicode, escaping all the non-conforming sequences in some manner. This would make the strings truly readable, as long as fonts for all the characters are available. And I had already suggested the utility functions you are suggesting, actually, in my first tirade against PEP 383 (search for "The encode and decode functions should be available for coders to use, that code to external interfaces, either OS or 3rd party packages, that do not use this encoding scheme"). I really don't care if you or who gets the credit for the idea, others may have suggested it before me, but I do care that the solution should provide functionality that works without ambiguity/data puns. The solution that was proposed in the lead up to releasing Python 3.0 was to offer both bytes and str interfaces (so we have those), and then for those that want to have a single portable implementation that can access all data, an object that encapsulates the differences, and the variant system APIs. (file system is one, command line is another, environment is another, I'm not sure if there are more.) I haven't heard if any progress on such an encapsulating object has been made; the people that proposed such have been rather quiet about this PEP. I would expect that an object implementation would provide display strings, and APIs to submit de novo str and bytes values to an object, which would run the appropriate encoding on them. Programs that want to use str interfaces on POSIX will see a subset of files on systems that contain files whose bytes filenames are not decodable. If a sysadmin wants to standardize on UTF-8 names universally, they can use something like convmv to clean up existing file names that don't conform. Programs that use str interfaces on POSIX system will work fine, but with a subset of the files. When that is unacceptable, they can either be recoded to use the bytes interfaces, or the hopefully forthcoming object encapsulation. The issue then will be what technique will be used to transform bytes into display names, but since the display names would never be fed back to the objects directly (but the object would have an interface to accept de novo str and de novo bytes) then it is just a display issue, and one that uses visible characters would seem more useful in my mind, than one that uses half-surrogates or PUAs. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

I think I may be able to resolve Glenn's issues with the scheme lower down (through careful use of definitions and hand waving). On 27Apr2009 23:52, Glenn Linderman <v+python@g.nevcal.com> wrote:
Yes.
And also versions that take or produce bytes from funny-encoded strings.
Isn't that the first two functions above?
I agree no such scheme exists. I don't think it can, just using strings. But _unless_ you have made a de novo handcrafted string with ill-formed sequences in it, you don't need to bother because you won't _have_ puns. If Martin's using half surrogates to encode "undecodable" bytes, then no normal string should conflict because a normal string will contain _only_ Unicode scalar values. Half surrogate code points are not such. The advantage here is that unless you've deliberately constructed an ill-formed unicode string, you _do_not_ need to recode into funny-encoding, because you are already compatible. Somewhat like one doesn't need to recode ASCII into UTF-8, because ASCII is unchanged.
Because that would _not_ be a no-op for well formed Unicode strings. That reason is sufficient for me. I consider the fact that well-formed Unicode -> funny-encoded is a no-op to be an enormous feature of Martin's scheme. Unless I'm missing something, there _are_no_puns_ between funny-encoded strings and well formed unicode strings.
I've just spent a cosy 20 minutes with my copy of Unicode 5.0 and a coffee, reading section 3.9 (Unicode Encoding Forms). I now do not believe your scenario makes sense. Someone can construct a Python3 string containing code points that includes surrogates. Granted. However such a string is not meaningful because it is not well-formed (D85). It's ill-formed (D84). It is not sane to expect it to translate into a POSIX byte sequence, be it UTF-8 or anything else, unless it is accompanied by some kind of explicit mapping provided by the programmer. Absent that mapping, it's nonsense in much the same way that a non-decodable UTF-8 byte sequence is nonsense. For example, Martin's funny-encoding is such an explicit mapping.
But those other names _don't_ exist.
I think that either we've lost track of what each other is saying, or you're wrong here. And my poor terminology hasn't been helping. What we've got: (1) Byte sequence files names in the POSIX file system. It doesn't matter whether the underlying storage is a real POSIX filesystem or mostly POSIX one like MacOSX HFS or a remotely attached non-POSIX filesystem like a Windows one, because we're talking through the POSIX API, and it is handing us byte sequences, which will expect may contain anything except a NUL. (2) Under Martin's scheme, os.listdir() et al hand us (and accept) funny-encoded Python3 strings, which are strings of Unicode code units (D77). Particularly, if there were bytes in the POSIX byte string that did not decode into Unicode scalar values (D76) then each such byte is encoded as a surrogate (D71,72,73,74). it is important to note here that because surrogates are _not_ Unicode scalar values, the is no punning between the two sets of values. (3) Other Python3 strings that have not been through Martin's mangler in either direction. Ordinary strings. Your concern is that, handed a string, a programmer could misuse (3) as (2) or vice versa because of punning. In a well-formed unicode string there are no surrogates; surrogates only occur in UTF-16 _encodings_ of Unicode strings (D75). Therefore, it _is_ possible to inspect a string, if one cared, to see if it is funny-encoded or "raw". One may get two different answers: - If there are surrogate code units then it must be funny-encoded and will therefore work perfectly if handed to a os.* interface. - If there are no surrogate code units the it may be funny encoded or it may not have been through Martin's funny-encoder, you can't tell. However, this doesn't matter because the encoder is a no-op for such strings. Therefore it will work perfectly if handed to an os.* interface. The only gap in this is a specially crated string containing surrogate code points that did not come via Martin's encoder. But such a string cannot come from a user interface, which will accept only characters and there only include unicode scalar values. Such a string can only be explicitly constructed (eg with a \uD802 code point). And if something constructs such a string, it must have in mind an explicit interpretation of those code points, which means it is the _constructor_ on whom the burden of translation lies. Does this make sesne to you, or have you a counter example in mind?
However, Martin's scheme explicitly translates these ill-formed sequences into Python3 strings and back, losslessly. You can have surrogates in the filesystem storage/API on Windows. You can have non-UTF-8-decodable sequences in the POSIX filesystem layer too. They're all taken in and handled. In Python3 space, one might have a bytes object with a raw POSIX byte filename in it. Presumably one can also have a byte string with a raw (UTF-16) WIndows filename in it. They're not strings, so no confusion. But there's no _string_ for these things without a matching string<->bytestring mapping associated with it. If you have a Python3 string which is well-formed Unicode, then you can hand it to the os.* interfaces and the Right Thing will happen (on Windows just because it stored Unicode and on POSIX provided you agree that your locale/getfilesystemencoding() is the right thing). If you have a string that isn't well-formed, then the meaning of any code points which are not Unicode scalar values is not well defined without some auxiliary stuff in the app.
See above. I think this is addressed. [...]
Sure. But that is reading byte sequences, and one must again know the encoding. If that is known and the input decoded happily into Unicode scalar values, then there is no issue. If the input didn't decode, then one must make some decision about what the non-decodable bits mean.
I'm of the option now that the puns can only occur when the source 2 string has surrogates, and either those surrogates are chosen to match the funny-encoding, in which case the pun is not a pun, or the surrogates are chosen according to a different scheme in which case source 2 is obliged to provide a mapping. A source 2 string of only Unicode scalar values doesn't need remapping.
I think any scheme that uses any Unicode scalar value as an escape character _inherently_ introduces puns, and puns that are easier to encounter. I think the real strength of Martin's scheme is exactly that bytes strings that needed the funny-encoding _do_ produce ill-formed Unicode strings, because such strings _cannot_ conflict with well-formed strings. I think your desire for a human readable encoding is valid, but it should be a further purely "presentation" step, somewhat like quoted-printable encoding in MIME, and not the scheme used by Martin.
But I think it would just move the punning. A human readable string with readable escapes in it may be funny-encoded. _Or_ it may be "raw", with funny-encoded yet to happen; after all only might weirdly be dealing with a filename which contained post-funny-encode visible sequences in it. SO you're right back to _guessing_ what you're looking at. WIth the surrogate scheme you only have to guess if there are surrogates, but then you _know_ that you're dealing with a special encoding scheme; it is certain - the guess is about which scheme. If you're working in a domain with no ill-formed strings you never need to worry at all. With a visible/printable-encoding such as you advocate the guess is about whether the scheme have even been used, which is why I think it is worse.
I must have missed that sentence. But it sounds like we want the same facilities at least.
I think covering these other cases is quite messy, if only because there's not even agreement amonst existing command line apps about all that stuff. Regarding "APIs to submit de novo str and bytes values to an object, which would run the appropriate encoding on them" I think such a facility for de novo strings must require the caller to provide a handler/mapper for the not-well-formed parts of such strings if they occur.
Not under Martin's scheme, because all bytes filenames _are_ decoded.
I agree it might be handy to have a display function, but isn't repr() exactly that, now I think of it? Cheers, -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ "waste cycles drawing trendy 3D junk" - Mac Eudora v3 config option

On approximately 4/28/2009 4:06 PM, came the following characters from the keyboard of Cameron Simpson:
I think I may be able to resolve Glenn's issues with the scheme lower down (through careful use of definitions and hand waving).
Close. You at least resolved what you thought my issue was. And, you did make me more comfortable with the idea that I, in programs I write, would not be adversely affected by the PEP if implemented. While I can see that the PEP no doubt solves the os.listdir / open problem on POSIX systems for Python 3 + PEP programs that don't use 3rd party libraries, it does require programs that do use 3rd party libraries to be recoded with your functions -- which so far the PEP hasn't embraced. Or, to use the bytes APIs directly to get file names for 3rd party libraries -- but the directly ported, filenames-as-strings type of applications that could call 3rd party filenames-as-bytes libraries in 2.x must be tweaked to do something different than they did before.
Yes, sorry.
Right. And I don't intend to generate ill-formed Unicode strings, in my programs. But I might well read their names from other sources. It is nice, and thank you for emphasizing (although I already did realize it, back there in the far reaches of the brain) that all the data puns are between ill-formed Unicode strings, and undecodable bytes strings. That is a nice property of the PEP's encoding/decoding method. I'm not sure it outweighs the disadvantage of taking unreadable gibberish, and producing indecipherable gibberish (codepoints with no glyphs), though, when there are ways to produce decipherable gibberish instead... or at least mostly-decipherable gibberish. Another idea forms.... described below.
I think you are correct regarding where the puns are. I agree that not perturbing well-formed Unicode is a benefit.
Such a string can be meaningful if it is used as a file name... it is the name of the file. I will agree that it would not be a word in any language, because it is composed of things that are not characters / codepoints, if that is what you meant.
They do if someone constructs them.
Lots of configuration systems permit schemes like C's \x to be used to create strings. Whether you perceive that to be a user interface or not, or believe that such things should be part of a user interface or not, they exist. Whether they validate that such strings are properly constructed Unicode text or should or should not do such validation, is open for discussion, but I'd be surprised if there are not some such schemes that don't do such checking, and consider it a feature. Why make the file name longer than necessary, when you can just use all these nice illegal codepoints to keep it shorter instead? Instead of 5 characters for a filename sequence counter, someone might stuff it in 1 character, in binary, and think they were clever. I've seen such techniques, although not specifically in Python, since I'm fairly new to reading Python code. So I consider it not beyond the realm of possibility to encounter lone surrogate code units in strings that haven't been through Martin's funny-encoder. Hence, I disbelieve that the gap you mention can be ignored.
It is still not clear whether the PEP (1) would be implemented on Windows (2) if it is, if it prevents lone surrogates from being obtained from the str APIs, by transcoding them into 3 lone surrogates, and if doesn't transcode from the str APIs, but does funny-decode from the bytes APIs, then it would seem there is still the possibility of data puns on Windows.
Without transcoding on the str APIs, which I haven't seen mentioned, I don't think so.
Sure. So the PEP needs your functions, or the equivalent. Last I checked, they weren't there.
A correct translation of source 2 strings would be obliged to call one of your functions, that doesn't exist in the PEP, because it appears the PEP wants to assume that such strings don't exist, unless it creates them. So this takes porting effort for programs generating and consuming such strings, to avoid being mangled by the PEP. That isn't necessary today, only post-PEP.
Another step? Even more porting effort? For a PEP that is trying to avoid porting effort? But maybe there is a compromise that mostly meets both goals: use U+DC10 as a (high-flying) escape character. It is not printable, so the substitution glyph will likely get displayed by display functions. Then transcode illegal bytes to the range U+0100 to U+01FF, and transcode existing U+DC10 to U+DC10 U+DC10. 1) This is an easy to understand scheme, and illegal byte values would become displayable, but would each be preceded by the substitution glyph for the U+DC10. 2) There would be no need to transcode other lone surrogates... on the other hand, any illegal code values could be treated as illegal bytes and transcoded, making the strings more nearly legal, and more uniformly displayable. 3) The property that all potential data puns are among ill-formed Unicode strings is still retained. 4) Because the result string is nearly legal Unicode (except for the escape characters U+DC10), it becomes uniformly comparable and different strings can be visibly different. 5) It is still necessary to transcode names from str interfaces, to escape any U+DC10 characters, at least, which is also required by this PEP to avoid data puns on systems that have both str and bytes interfaces.
I think you mean you don't have to guess if there are lone surrogates... you can look and see.
So the above scheme, using a U+DC10 escape character, meets your desirable truisms about lone surrogates being the trigger for knowing that you are dealing with bizarro names, but being uncertain about which kind, and also makes the results lots more readable. I still think there is a need to provide the encoding and decoding functions, for both bytes and de novo strings.
The caller shouldn't have to supply anything. The same encoding that is applied to str system interfaces that supply strings should be applied to de novo strings. It is just a matter of transcoding a de novo string into the "right form" that it can then be encoded by the system encoder to produce the original string again, if it goes to a str interface, or to an equivalent bytes string, if it goes to a bytes interface.
I think I was speaking of the status quo, here, not with the PEP.
repr is a display function that produces rather ugly results in most non-ASCII cases. But then again, one could use repr as the funny-encoding scheme, too... I don't think we want to use repr for either case, actually. Of course, with Py 3, if the file names were objects, and could have reprlib customizations... :) :) -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

Glenn Linderman a écrit :
The problem with your "escape character" scheme is that the meaning is lost with slicing of the strings, which is a very common operation.
Python could as well *specify* that lone surrogates are illegal, as their meaning is undefined by Unicode. If this rule is respected language-wise, there is no ambiguity. It might be unrealistic on windows, though. This rule could even be specified only for strings that represent filesystem paths. Sure, they are the same type as other strings, but the programmer usually knows if a given string is intended to be a path or not. Baptiste

2009/4/28 Glenn Linderman <v+python@g.nevcal.com>:
Sorry for picking on Glenn's comment - it's only one of many in this thread. But it seems to me that there is an assumption that problems will arise when code gets a potentially funny-decoded string and doesn't know where it came from. Is that a real concern? How many programs really don't know where their data came from? Maybe a general-purpose library routine *might* just need to document explicitly how it handles funny-encoded data (I can't actually imagine anything that would, but I'll concede it may be possible) but that's just a matter of documenting your assumptions - no better or worse than many other cases. This all sounds similar to the idea of "tainted" data in security - if you lose track of untrusted data from the environment, you expose yourself to potential security issues. So the same techniques should be relevant here (including ignoring it if your application isn't such that it's s concern!) I've yet to hear anyone claim that they would have an actual problem with a specific piece of code they have written. (NB, if such a claim has been made, feel free to point me to it - I admit I've been skimming this thread at times). Paul.

Paul Moore <p.f.moore <at> gmail.com> writes:
I've yet to hear anyone claim that they would have an actual problem with a specific piece of code they have written.
Yep, that's the problem. Lots of theoretical problems noone has ever encountered brought up against a PEP which resolves some actual problems people encounter on a regular basis. For the record, I'm +1 on the PEP being accepted and implemented as soon as possible (preferably before 3.1). Regards Antoine.


For what it's worth, the OSX API's seem to behave as follows: * If you create a file with an non-UTF8 name on a HFS+ filesystem the system automaticly encodes the name. That is, open(chr(255), 'w') will silently create a file named '%FF' instead of the name you'd expect on a unix system. * If you mount an NFS filesystem from a linux host and that directory contains a file named chr(255) - unix-level tools will see a file with the expected name (just like on linux) - Cocoa's NSFileManager returns u"?" as the filename, that is when the filename cannot be decoded using UTF-8 the name returned by the high- level API is mangled. This is regardless of the setting of LANG. - I haven't found a way yet to access files whose names are not valid UTF-8 using the high-level Cocoa API's. The latter two are interesting because Cocoa has a unicode filesystem API on top of a POSIX C-API, just like Python 3.x. I guess the choosen behaviour works out on OSX (where users are unlikely to run into this issue), but could be more problematic on other POSIX systems. Ronald On 28 Apr, 2009, at 14:03, Michael Foord wrote:

Ronald Oussoren <ronaldoussoren@mac.com> (RO) wrote:
RO> That is, open(chr(255), 'w') will silently create a file named '%FF' RO> instead of the name you'd expect on a unix system.
Not for me (I am using Python 2.6.2).
I once got a tar file from a Linux system which contained a file with a non-ASCII, ISO-8859-1 encoded filename. The tar file refused to be unpacked on a HFS+ filesystem. -- Piet van Oostrum <piet@cs.uu.nl> URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4] Private email: piet@vanoostrum.org

Ned Deily <nad@acm.org> (ND) wrote:
ND> What version of OSX are you using? On Tiger 10.4.11 I see the failure ND> you see but on Leopard 10.5.6 the behavior Ronald reports.
Yes, I am using Tiger (10.4.11). Interesting that it has changed on Leopard. -- Piet van Oostrum <piet@cs.uu.nl> URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4] Private email: piet@vanoostrum.org

You can get the same error on Linux: $ python Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41) [GCC 4.3.3] on linux2 Type "help", "copyright", "credits" or "license" for more information.
(Some file system drivers do not enforce valid utf8 yet, but I suspect they will in the future.) Tom

Thomas Breuel wrote:
Do you suspect that from discussing the issue with kernel developers or reading a thread on lkml? If not, then you're suspicion seems to be pretty groundless.... The fact that VFAT enforces an encoding does not lend itself to your argument for two reasons: 1) VFAT is not a Unix filesystem. It's a filesystem that's compatible with Windows/DOS. If Windows and DOS have filesystem encodings, then it makes sense for that driver to enforce that as well. Filesystems intended to be used natively on Linux/Unix do not necessarily make this design decision. 2) The encoding is specified when mounting the filesystem. This means that you can still mix encodings in a number of ways. If you mount with an encoding that has full byte coverage, for instance, each user can put filenames from different encodings on there. If you mount with utf8 on a system which uses euc-jp as the default encoding, you can have full paths that contain a mix of utf-8 and euc-jp. Etc. -Toshio

On Fri, 1 May 2009 06:55:48 am Thomas Breuel wrote:
Works for me under Fedora using ext3 as the file system. $ python2.6 Python 2.6.1 (r261:67515, Dec 24 2008, 00:33:13) [GCC 4.1.2 20070502 (Red Hat 4.1.2-12)] on linux2 Type "help", "copyright", "credits" or "license" for more information.
Given that chr(255) is a valid filename on my file system, I would consider it a bug if Python couldn't deal with a file with that name. -- Steven D'Aprano

On 30 Apr, 2009, at 21:33, Piet van Oostrum wrote:
That's odd. Which version of OSX do you use? ronald@Rivendell-2[0]$ sw_vers ProductName: Mac OS X ProductVersion: 10.5.6 BuildVersion: 9G55 [~/testdir] ronald@Rivendell-2[0]$ /usr/bin/python Python 2.5.1 (r251:54863, Jan 13 2009, 10:26:13) [GCC 4.0.1 (Apple Inc. build 5465)] on darwin Type "help", "copyright", "credits" or "license" for more information.
And likewise with python 2.6.1+ (after cleaning the directory): [~/testdir] ronald@Rivendell-2[0]$ python2.6 Python 2.6.1+ (release26-maint:70603, Mar 26 2009, 08:38:03) [GCC 4.0.1 (Apple Inc. build 5493)] on darwin Type "help", "copyright", "credits" or "license" for more information.

How can you bring up practical problems against something that hasn't been implemented? The fact that no other language or library does this is perhaps an indication that it isn't the right thing to do. But the biggest problem with the proposal is that it isn't needed: if you want to be able to turn arbitrary byte sequences into unicode strings and back, just set your encoding to iso8859-15. That already works and it doesn't require any changes. Tom

Thomas Breuel <tmbdev <at> gmail.com> writes:
How can you bring up practical problems against something that hasn't been
The fact that no other language or library does this is perhaps an indication
implemented? The PEP is simple enough that you can simulate its effect by manually computing the resulting unicode string for a hypothetical broken filename. Several people have already done so in this thread. that it isn't the right thing to do. According to some messages, it seems Java and Mono actually use this kind of workaround. Though I haven't checked (I don't use those languages).
That doesn't work at all. With your proposal, any non-ASCII filename will be unreadable; not only the broken ones. Antoine.

On 28Apr2009 14:37, Thomas Breuel <tmbdev@gmail.com> wrote: | But the biggest problem with the proposal is that it isn't needed: if you | want to be able to turn arbitrary byte sequences into unicode strings and | back, just set your encoding to iso8859-15. That already works and it | doesn't require any changes. No it doesn't. It does transcode without throwing exceptions. On POSIX. (On Windows? I doubt it - windows isn't using an 8-bit scheme. I believe.) But it utter destorys any hope of working in any other locale nicely. The PEP lets you work losslessly in other locales. It _may_ require some app care for particular very weird strings that don't come from the filesystem, but as far as I can see only in circumstances where such care would be needed anyway i.e. you've got to do special stuff for weirdness in the first place. Weird == "ill-formed unicode string" here. Cheers, -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ I just kept it wide-open thinking it would correct itself. Then I ran out of talent. - C. Fittipaldi

On 28Apr2009 11:49, Antoine Pitrou <solipsis@pitrou.net> wrote: | Paul Moore <p.f.moore <at> gmail.com> writes: | > | > I've yet to hear anyone claim that they would have an actual problem | > with a specific piece of code they have written. | | Yep, that's the problem. Lots of theoretical problems noone has ever encountered | brought up against a PEP which resolves some actual problems people encounter on | a regular basis. | | For the record, I'm +1 on the PEP being accepted and implemented as soon as | possible (preferably before 3.1). I am also +1 on this. I would like utility functions to perform: os-bytes->funny-encoded funny-encoded->os-bytes or explicit example code snippets for same in the PEP text. -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ This person is currently undergoing electric shock therapy at Agnews Developmental Center in San Jose, California. All his opinions are static, please ignore him. Thank you, Nurse Ratched - the sig quote of Bob "Another beer, please" Christ <bhatch@netcom.com>

On 29Apr2009 08:27, Martin v. L?wis <martin@v.loewis.de> wrote: | > I would like utility functions to perform: | > os-bytes->funny-encoded | > funny-encoded->os-bytes | > or explicit example code snippets for same in the PEP text. | | Done! Thanks! -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/

Paul Moore writes:
Yes, it's a real concern. I don't think it's possible to show a small piece of code one could point at and say "without a better API I bet you can't write this correctly," though. Rather, my experience with Emacs and various mail packages is that without type information it is impossible to keep track of the myriad bits and pieces of text that are recombining like pig flu, and eventually one breaks out and causes an error. It's usually easy to fix, but so are the next hundred similar regressions, and in the meantime a hundred users have suffered more or less damage or at least annoyance. There's no question that dealing with escapes of funny-decoded strings to uprepared code paths is mission creep compared to Martin's stated purpose for PEP 383, but it is also a real problem.

Simon Cross wrote:
[I hope, by "second part", you refer to the part that I left] It's true that UTF-8 could represent all Windows file names. However, the byte-oriented APIs of Windows do not use UTF-8, but instead, they use the Windows ANSI code page (which varies with the installation).
No, because the Windows API would interpret the bytes differently, and not find the right file. Regards, Martin

Why not use U+DCxx for non-UTF-8 encodings too?
I thought of that, and was tricked into believing that only U+DC8x is a half surrogate. Now I see that you are right, and have fixed the PEP accordingly. Regards, Martin

Thanks for writing this PEP 383, MvL. I recently ran into this problem in Python 2.x in the Tahoe project [1]. The Tahoe project should be considered a good use case showing what some people need. For example, the assumption that a file will later be written back into the same local filesystem (and thus luckily use the same encoding) from which it originally came doesn't hold for us, because Tahoe is used for file-sharing as well as for backup-and-restore. One of my first conclusions in pursuing this issue is that we can never use the Python 2.x unicode APIs on Linux, just as we can never use the Python 2.x str APIs on Windows [2]. (You mentioned this ugliness in your PEP.) My next conclusion was that the Linux way of doing encoding of filenames really sucks compared to, for example, the Mac OS X way. I'm heartened to see what David Wheeler is trying to persuade the maintainers of Linux filesystems to improve some of this: [3]. My final conclusion was that we needed to have two kinds of workaround for the Linux suckage: first, if decoding using the suggested filesystem encoding fails, then we fall back to mojibake [4] by decoding with iso-8859-1 (or else with windows-1252 -- I'm not sure if it matters and I haven't yet understood if utf-8b offers another alternative for this case). Second, if decoding succeeds using the suggested filesystem encoding on Linux, then write down the encoding that we used and include that with the filename. This expands the size of our filenames significantly, but it is the only way to allow some future programmer to undo the damage of a falsely- successful decoding. Here's our whole plan: [5]. Regards, Zooko [1] http://allmydata.org [2] http://allmydata.org/pipermail/tahoe-dev/2009-March/001379.html # see the footnote of this message [3] http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html [4] http://en.wikipedia.org/wiki/Mojibake [5] http://allmydata.org/trac/tahoe/ticket/534#comment:47

How about another str-like type, a sequence of char-or-bytes? Could be called strbytes or stringwithinvalidcharacters. It would support whatever subset of str functionality makes sense / is easy to implement plus a to_escaped_str() method (that does the escaping the PEP talks about) for people who want to use regexes or other str-only stuff. Here is a description by example: os.listdir('.') -> [strbytes('normal_file'), strbytes('bad', 128, 'file')] strbytes('a')[0] -> strbytes('a') strbytes('bad', 128, 'file')[3] -> strbytes(128) strbytes('bad', 128, 'file').to_escaped_str() -> 'bad?128file' Having a separate type is cleaner than a "str that isn't exactly what it represents". And making the escaping an explicit (but rarely-needed) step would be less surprising for users. Anyway, I don't know a whole lot about this issue so there may an obvious reason this is a bad idea. On Wed, Apr 22, 2009 at 6:50 AM, "Martin v. Löwis" <martin@v.loewis.de> wrote:

On 22 Apr 2009, at 07:50, Martin v. Löwis wrote:
Forgive me if this has been covered. I've been reading this thread for a long time and still have a 100 odd replies to go... How do get a printable unicode version of these path strings if they contain none unicode data? I'm guessing that an app has to understand that filenames come in two forms unicode and bytes if its not utf-8 data. Why not simply return string if its valid utf-8 otherwise return bytes? Then in the app you check for the type for the object, string or byte and deal with reporting errors appropriately. Barry

On 29Apr2009 23:41, Barry Scott <barry@barrys-emacs.org> wrote:
Personally, I'd use repr(). One might ask, what would you expect to see if you were printing such a string?
Because it complicates the app enormously, for every app. It would be _nice_ to just call os.listdir() et al with strings, get strings, and not worry. With strings becoming unicode in Python3, on POSIX you have an issue of deciding how to get its filenames-are-bytes into a string and the reverse. One could naively map the byte values to the same Unicode code points, but that results in strings that do not contain the same characters as the user/app expects for byte values above 127. Since POSIX does not really have a filesystem level character encoding, just a user environment setting that says how the current user encodes characters into bytes (UTF-8 is increasingly common and useful, but it is not universal), it is more useful to decode filenames on the assumption that they represent characters in the user's (current) encoding convention; that way when things are displayed they are meaningful, and they interoperate well with strings made by the user/app. If all the filenames were actually encoded that way when made, that works. But different users may adopt different conventions, and indeed a user may have used ACII or and ISO8859-* coding in the past and be transitioning to something else now, so they will have a bunch of files in different encodings. The PEP uses the user's current encoding with a handler for byte sequences that don't decode to valid Unicode scaler values in a fashion that is reversible. That is, you get "strings" out of listdir() and those strings will go back in (eg to open()) perfectly robustly. Previous approaches would either silently hide non-decodable names in listdir() results or throw exceptions when the decode failed or mangle things no reversably. I believe Python3 went with the first option there. The PEP at least lets programs naively access all files that exist, and create a filename from any well-formed unicode string provided that the filesystem encoding permits the name to be encoded. The lengthy discussion mostly revolves around: - Glenn points out that strings that came _not_ from listdir, and that are _not_ well-formed unicode (== "have bare surrogates in them") but that were intended for use as filenames will conflict with the PEP's scheme - programs must know that these strings came from outside and must be translated into the PEP's funny-encoding before use in the os.* functions. Previous to the PEP they would get used directly and encode differently after the PEP, thus producing different POSIX filenames. Breakage. - Glenn would like the encoding to use Unicode scalar values only, using a rare-in-filenames character. That would avoid the issue with "outside' strings that contain surrogates. To my mind it just moves the punning from rare illegal strings to merely uncommon but legal characters. - Some parties think it would be better to not return strings from os.listdir but a subclass of string (or at least a duck-type of string) that knows where it came from and is also handily recognisable as not-really-a-string for purposes of deciding whether is it PEP-funny-encoded by direct inspection. Cheers, -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ The peever can look at the best day in his life and sneer at it. - Jim Hill, JennyGfest '95

On Thu, Apr 30, 2009, Cameron Simpson wrote:
Assuming people agree that this is an accurate summary, it should be incorporated into the PEP. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "If you think it's expensive to hire a professional to do the job, wait until you hire an amateur." --Red Adair

On approximately 4/29/2009 7:50 PM, came the following characters from the keyboard of Aahz:
I'll agree that once other misconceptions were explained away, that the remaining issues are those Cameron summarized. Thanks for the summary! Point two could be modified because I've changed my opinion; I like the invariant Cameron first (I think) explicitly stated about the PEP as it stands, and that I just reworded in another message, that the strings that are altered by the PEP in either direction are in the subset of strings that contain fake (from a strict Unicode viewpoint) characters. I still think an encoding that uses mostly real characters that have assigned glyphs would be better than the encoding in the PEP; but would now suggest that an escape character be a fake character. I'll note here that while the PEP encoding causes illegal bytes to be translated to one fake character, the 3-byte sequence that looks like the range of fake characters would also be translated to a sequence of 3 fake characters. This is 512 combinations that must be translated, and understood by the user (or at least by the programmer). The "escape sequence" approach requires changing only 257 combinations, and each altered combination would result in exactly 2 characters. Hence, this seems simpler to understand, and to manually encode and decode for debugging purposes. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking

How do get a printable unicode version of these path strings if they contain none unicode data?
Define "printable". One way would be to use a regular expression, replacing all codes in a certain range with a question mark.
That would have been an alternative solution, and the one that 2.x uses for listdir. People didn't like it. Regards, Martin

On 30 Apr 2009, at 05:52, Martin v. Löwis wrote:
What I mean by printable is that the string must be valid unicode that I can print to a UTF-8 console or place as text in a UTF-8 web page. I think your PEP gives me a string that will not encode to valid UTF-8 that the outside of python world likes. Did I get this point wrong?
In our application we are running fedora with the assumption that the filenames are UTF-8. When Windows systems FTP files to our system the files are in CP-1251(?) and not valid UTF-8. What we have to do is detect these non UTF-8 filename and get the users to rename them. Having an algorithm that says if its a string no problem, if its a byte deal with the exceptions seems simple. How do I do this detection with the PEP proposal? Do I end up using the byte interface and doing the utf-8 decode myself? Barry

You are right. However, if your *only* requirement is that it should be printable, then this is fairly underspecified. One way to get a printable string would be this function def printable_string(unprintable): return "" This will always return a printable version of the input string...
That would be a bug in your FTP server, no? If you want all file names to be UTF-8, then your FTP server should arrange for that.
No, you should encode using the "strict" error handler, with the locale encoding. If the file name encodes successfully, it's correct, otherwise, it's broken. Regards, Martin

On 30 Apr 2009, at 21:06, Martin v. Löwis wrote:
Ha ha! Indeed this works, but I would have to try to turn enough of the string into a reasonable hint at the name of the file so the user can some chance of know what is being reported.
Not a bug its the lack of a feature. We use ProFTPd that has just implemented what is required. I forget the exact details - they are at work - when the ftp client asks for the FEAT of the ftp server the server can say use UTF-8. Supporting that in the server was apparently none-trivia.
O.k. I understand. Barry

Barry Scott wrote:
What do you do currently? The PEP just offers a way of reading all filenames as Unicode, if that's what you want. So what if the strings can't be encoded to normal UTF-8! The filenames aren't valid UTF-8 anyway! :-)
participants (31)
-
"Martin v. Löwis"
-
Aahz
-
Adrian
-
Antoine Pitrou
-
Baptiste Carvello
-
Barry Scott
-
Benjamin Peterson
-
Cameron Simpson
-
Dirkjan Ochtman
-
Glenn Linderman
-
glyph@divmod.com
-
James Y Knight
-
Lino Mastrodomenico
-
M.-A. Lemburg
-
Michael Foord
-
MRAB
-
Ned Deily
-
Nick Coghlan
-
Paul Moore
-
Piet van Oostrum
-
R. David Murray
-
Robert Collins
-
Ronald Oussoren
-
Simon Cross
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Terry Reedy
-
Thomas Breuel
-
Toshio Kuratomi
-
Walter Dörwald
-
Zooko O'Whielacronx