I thought PEP-383 was a fairly neat approach, but after thinking about it, I now think that it is wrong. PEP-383 attempts to represent non-UTF-8 byte sequences in Unicode strings in a reversible way. But how do those non-UTF-8 byte sequences get into those path names in the first place? Most likely because an encoding other than UTF-8 was used to write the file system, but you're now trying to interpret its path names as UTF-8. Quietly escaping a bad UTF-8 encoding with private Unicode characters is unlikely to be the right thing, since using the wrong encoding likely means that other characters are decoded incorrectly as well. As a result, the path name may fail in string comparisons and pattern matching, and will look wrong to the user in print statements and dialog boxes. Therefore, when Python encounters path names on a file system that are not consistent with the (assumed) encoding for that file system, Python should raise an error. If you really don't care what the string looks like and you just want an encoding that round-trips without loss, you can probably just set your encoding to one of the 8 bit encodings, like ISO 8859-15. Decoding arbitrary byte sequences to unicode strings as ISO 8859-15 is no less correct than decoding them as the proposed "utf-8b". In fact, the most likely source of non-UTF-8 sequences is ISO 8859 encodings. As for what the byte-oriented interfaces should do, they are simply platform dependent. On UNIX, they should do the obvious thing. On Windows, they can either hook up to the low-level byte-oriented system calls that the systems supply, or Windows could fake it and have the byte-oriented interfaces use UTF-8 encodings always and reject non-UTF-8 sequences as illegal (there are already many illegal byte sequences anyway). Tom
PEP-383 attempts to represent non-UTF-8 byte sequences in Unicode strings in a reversible way.
That isn't really true; it is not, inherently, about UTF-8. Instead, it tries to represent non-filesystem-encoding byte sequence in Unicode strings in a reversible way.
Quietly escaping a bad UTF-8 encoding with private Unicode characters is unlikely to be the right thing
And indeed, the PEP stopped using PUA characters.
Therefore, when Python encounters path names on a file system that are not consistent with the (assumed) encoding for that file system, Python should raise an error.
This is what happens currently, and users are quite unhappy about it.
If you really don't care what the string looks like and you just want an encoding that round-trips without loss, you can probably just set your encoding to one of the 8 bit encodings, like ISO 8859-15. Decoding arbitrary byte sequences to unicode strings as ISO 8859-15 is no less correct than decoding them as the proposed "utf-8b". In fact, the most likely source of non-UTF-8 sequences is ISO 8859 encodings.
Yes, users can do that (to a degree), but they are still unhappy about it. The approach actually fails for command line arguments
As for what the byte-oriented interfaces should do, they are simply platform dependent. On UNIX, they should do the obvious thing. On Windows, they can either hook up to the low-level byte-oriented system calls that the systems supply, or Windows could fake it and have the byte-oriented interfaces use UTF-8 encodings always and reject non-UTF-8 sequences as illegal (there are already many illegal byte sequences anyway).
As is, these interfaces are incomplete - they don't support command line arguments, or environment variables. If you want to complete them, you should write a PEP. Regards, Martin
Therefore, when Python encounters path names on a file system that are not consistent with the (assumed) encoding for that file system, Python should raise an error.
This is what happens currently, and users are quite unhappy about it.
We need to keep "users" and "programmers" distinct here. Programmers may find it inconvenient that they have to spend time figuring out and deal with platform-dependent file system encoding issues and errors. But internationalization and unicode are hard, that's just a fact of life. End users, however, are going to be quite unhappy if they get a string of gibberish for a file name because you decided to interpret some non-Unicode string as UTF-8-with-extra-bytes. Or some Python program might copy files from an ISO8859-15 encoded file system to a UTF-8 encoded file system, and instead of getting an error when the encodings are set incorrectly, Python would quietly create ISO8859-15 encoded file names, making the target file system inconsistent. There is a lot of potential for major problems for end users with your proposals. In both cases, what should happen is that the end user gets an error, submits a bug, and the programmer figures out how to deal with the encoding issues correctly.
Yes, users can do that (to a degree), but they are still unhappy about it. The approach actually fails for command line arguments
As it should: if I give an ISO8859-15 encoded command line argument to a Python program that expects a UTF-8 encoding, the Python program should tell me that there is something wrong when it notices that. Quietly continuing is the wrong thing to do. If we follow your approach, that ISO8859-15 string will get turned into an escaped unicode string inside Python. If I understand your proposal correctly, if it's a output file name and gets passed to Python's open function, Python will then decode that string and end up with an ISO8859-15 byte sequence, which it will write to disk literally, even if the encoding for the system is UTF-8. That's the wrong thing to do. As is, these interfaces are incomplete - they don't support command
line arguments, or environment variables. If you want to complete them, you should write a PEP.
There's no point in scratching when there's no itch. Tom PS:
Quietly escaping a bad UTF-8 encoding with private Unicode characters is
unlikely to be the right thing
And indeed, the PEP stopped using PUA characters.
Let me rephrase this: "quietly escaping a bad UTF-8 encoding is unlikely to be the right thing"; it doesn't matter how you do it.
On Tue, Apr 28, 2009 at 09:30:01AM +0200, Thomas Breuel wrote:
Programmers may find it inconvenient that they have to spend time figuring out and deal with platform-dependent file system encoding issues and errors. But internationalization and unicode are hard, that's just a fact of life.
Until it's hard there will be no internationalization. A fact of life, damn it. Programmers are lazy, and have many problems to solve.
end user gets an error, submits a bug, and the programmer figures out how to deal with the encoding issues correctly.
And the programmer answers "The program is expected a correct environment, good filenames, etc." and closes the issue with the resolution "User error, will not fix". I am not arguing for or against the PEP in question. Python certainly has to have a way to make portable i18n less hard or else the number of portable internationalized program will be about zero. What the way should be - I don't know. Oleg. -- Oleg Broytmann http://phd.pp.ru/ phd@phd.pp.ru Programmers don't die, they just GOSUB without RETURN.
Until it's hard there will be no internationalization. A fact of life, damn it. Programmers are lazy, and have many problems to solve.
PEP 383 doesn't make it any easier; it just turns one set of problems into another. Actually, it makes it worse, since any problems that show up now show up far from the source of the problem, and since it can lead to security problems and/or data loss.
And the programmer answers "The program is expected a correct environment, good filenames, etc." and closes the issue with the resolution "User error, will not fix".
The problem may well be with the program using the wrong encodings or incorrectly ignoring encoding information. Furthermore, even if it is user error, the program needs to validate its inputs and put up a meaningful error message, not mangle the disk. To detect such program bugs, it's important that when Python detects an incorrect encoding that it doesn't quietly continue with an incorrect string. Furthermore, if you don't provide clear error messages, it often takes a significant amount of time for each issue to determine that it is user error.
I am not arguing for or against the PEP in question. Python certainly has to have a way to make portable i18n less hard or else the number of portable internationalized program will be about zero. What the way should be - I don't know.
Returning an error for an incorrect encoding doesn't make internationalization harder, it makes it easier because it makes debugging easier. Tom
On Tue, Apr 28, 2009 at 10:37:45AM +0200, Thomas Breuel wrote:
Returning an error for an incorrect encoding doesn't make internationalization harder, it makes it easier because it makes debugging easier.
What is a "correct encoding"? I have an FTP server to which clients with different local encodings are connecting. FTP protocol doesn't have a notion of encoding so filenames on the filesystem are in koi8-r, cp1251 and utf-8 encodings - all in one directory! What should os.listdir() return for that directory? What is a correct encoding for that directory?! If any program starts to raise errors Python becomes completely unusable for me! But is there anything I can debug here? Oleg. -- Oleg Broytmann http://phd.pp.ru/ phd@phd.pp.ru Programmers don't die, they just GOSUB without RETURN.
On Tue, Apr 28, 2009 at 11:00, Oleg Broytmann <phd@phd.pp.ru> wrote:
On Tue, Apr 28, 2009 at 10:37:45AM +0200, Thomas Breuel wrote:
Returning an error for an incorrect encoding doesn't make internationalization harder, it makes it easier because it makes debugging easier.
What is a "correct encoding"?
I have an FTP server to which clients with different local encodings are connecting. FTP protocol doesn't have a notion of encoding so filenames on the filesystem are in koi8-r, cp1251 and utf-8 encodings - all in one directory! What should os.listdir() return for that directory? What is a correct encoding for that directory?!
I don't know what it should do (ftplib needs to worry about that). I do know what it shouldn't do, however: it sould not return a utf-8b string which, when used to create a file, will create a file reproducing the byte sequence of the remote machine; that's wrong. If any program starts to raise errors Python becomes completely unusable
for me! But is there anything I can debug here?
If we follow PEP 383, you will get lots of errors anyway because those strings, when encoded in utf-8b, will result in an error when you try to write them on a Windows file system or any other system that doesn't allow the byte sequences that the utf-8b encodes. Tom
On Tue, Apr 28, 2009 at 11:32:26AM +0200, Thomas Breuel wrote:
On Tue, Apr 28, 2009 at 11:00, Oleg Broytmann <phd@phd.pp.ru> wrote:
I have an FTP server to which clients with different local encodings are connecting. FTP protocol doesn't have a notion of encoding so filenames on the filesystem are in koi8-r, cp1251 and utf-8 encodings - all in one directory! What should os.listdir() return for that directory? What is a correct encoding for that directory?!
I don't know what it should do (ftplib needs to worry about that).
There is no ftplib there. FTP server is ProFTPd, ftp clients of all sort, one, e.g., an ftp client built-in into an automatic web-camera. I use python programs to process files after they have been uploaded. The programs access FTP directory as a part of local filesystem.
I do know what it shouldn't do, however: it sould not return a utf-8b string which, when used to create a file, will create a file reproducing the byte sequence of the remote machine; that's wrong.
That certainly wrong. But at least the approach allows python programs to list all files in a directory - currently AFAIU os.listdir() silently skips undecodeable filenames. And after a program gets all files it can process it further - it can cleanup filenames (base64-encode them, e.g.), but at least it can do something, where currently it cannot. PS. It seems I started to argue for the PEP. Well, well... Oleg. -- Oleg Broytmann http://phd.pp.ru/ phd@phd.pp.ru Programmers don't die, they just GOSUB without RETURN.
2009/4/28 Thomas Breuel <tmbdev@gmail.com>:
If we follow PEP 383, you will get lots of errors anyway because those strings, when encoded in utf-8b, will result in an error when you try to write them on a Windows file system or any other system that doesn't allow the byte sequences that the utf-8b encodes.
I'm not sure if when you say "write them on a Windows FS" you mean from within Windows itself or a filesystem mounted on another OS, so I'll cover both cases. Let's suppose that I use Python 2.x or something else to create a file with name b'\xff'. My (Linux) system has a sane configuration and the filesystem encoding is UTF-8, so it's an invalid name but the kernel will blindly accept it anyway. With this PEP, Python 3.1 listdir() will convert b'\xff' to the string '\udcff'. Now if this string somehow ends up in a Python 3.1 program running on Windows and it tries to create a file with this name, it will work (no exception will be raised). The Windows GUI will display the standard "invalid character" symbol (an empty box) when listing this file, but this seems reasonable since the original file was displayed as "?" by the Linux console and with another invalid character symbol by the GNOME file manager. OTOH if I write the same file on a Windows filesystem mounted on another OS, there will be in place an automatic translation (probably done by the OS kernel) from the user-visible filesystem encoding (see e.g. the "iocharset" or "utf8" mount options for vfat on Linux) to UTF-16. Which means that the write will fail with something like: IOError: [Errno 22] invalid filename: b'/media/windows_disk/\xff' (The "problem" is that a vfat filesystem mounted with the "utf8" option on Linux will only accept byte sequences that are valid UTF-8, or at least reasonably similar: e.g. b'\xed\xb3\xbf' is accepted.) Again this seems reasonable since it already happens in Python 2 and with pretty much any other software, including GNU cp. I don't see how Martin can do better than this. Well, ok, I guess he could break into my house and rename the original file to something sane... -- Lino Mastrodomenico
Thomas Breuel writes:
PEP 383 doesn't make it any easier; it just turns one set of problems into another.
That's false. There is an interesting class of problems of the form "get a list of names from the OS and allow the user to select from it, and retrieve corresponding content." People are *very* often able to decode complete gibberish, as long as it's the only gibberish in a list. Ditto partial gibberish. In that case, PEP 383 allows the content retrieval operation to complete. There are probably other problems that this PEP solves.
Actually, it makes it worse,
Again, it gives you different problems, which may be better and may be worse according to the user's requirements. Currently, you often get an exception, and running the program again is no help. The user must clean up the list to make progress. This may or may not be within the user's capacity (eg, read-only media).
since any problems that show up now show up far from the source of the problem, and since it can lead to security problems and/or data loss.
Yes. This is a point I have been at pains to argue elsewhere in this thread. However, it is "mission creep": Martin didn't volunteer to write a PEP for it, he volunteered to write a PEP to solve the "roundtrip the value of os.listdir()" problem. And he succeeded, up to some minor details.
The problem may well be with the program using the wrong encodings or incorrectly ignoring encoding information. Furthermore, even if it is user error, the program needs to validate its inputs and put up a meaningful error message, not mangle the disk. To detect such program bugs, it's important that when Python detects an incorrect encoding that it doesn't quietly continue with an incorrect string.
I agree. Guido, however, responded that "Practicality beats purity" to a similar point in the PEP 263 discussion. Be aware that you're fighting an uphill battle here.
However, it is "mission creep": Martin didn't volunteer to write a PEP for it, he volunteered to write a PEP to solve the "roundtrip the value of os.listdir()" problem. And he succeeded, up to some minor details.
Yes, it solves that problem. But that doesn't come without cost. Most importantly, now Python writes illegal UTF-8 strings even if the user chose a UTF-8 encoding. That means that illegal UTF-8 encodings can propagate anywhere, without warning. Furthermore, I don't believe that PEP 383 works consistently on Windows, and it causes programs to behave differently in unintuitive ways on Windows and Linux. I'll suggest an alternative in a separate message. Tom
Martin v. Löwis wrote:
Furthermore, I don't believe that PEP 383 works consistently on Windows,
What makes you say that? PEP 383 will have no effect on Windows, compared to the status quo, whatsoever.
You could argue that if Windows is actually returning UTF-16 with half surrogates that they should be altered to conform to what UTF-8 would have returned.
MRAB wrote:
Martin v. Löwis wrote:
Furthermore, I don't believe that PEP 383 works consistently on Windows,
What makes you say that? PEP 383 will have no effect on Windows, compared to the status quo, whatsoever.
You could argue that if Windows is actually returning UTF-16 with half surrogates that they should be altered to conform to what UTF-8 would have returned.
Perhaps - but this is not what the PEP specifies (and intentionally so). Regards, Martin
On Tue, Apr 28, 2009 at 20:45, "Martin v. Löwis" <martin@v.loewis.de> wrote:
Furthermore, I don't believe that PEP 383 works consistently on Windows,
What makes you say that? PEP 383 will have no effect on Windows, compared to the status quo, whatsoever.
That's what you believe, but it's not clear to me that that follows from your proposal. Your proposal says that utf-8b would be used for file systems, but then you also say that it might be used for command line arguments and environment variables. So, which specific APIs will it be used with on Windows and on POSIX systems? Or will utf-8b simply not be available on Windows at all? What happens if I create a Python version of tar, utf-8b strings slip in there, and I try to use them on Windows? You also assume that all Windows file system functions strictly conform to UTF-16 in practice (not just on paper). Have you verified that? It certainly isn't true across all versions of Windows (since NT originally used UCS-2). What's the situation on Windows CE? Another question on Linux: what happens when I decode a file system path with utf-8b and then pass the resulting unicode string to Gnome? To Qt? To windows.forms? To Java? To a unicode regular expression library? To wprintf? AFAIK, the behavior of most libraries is undefined for the kinds of unicode strings you construct, and it may be undefined in a bad way (crash, buffer overflow, whatever). Tom
Your proposal says that utf-8b would be used for file systems, but then you also say that it might be used for command line arguments and environment variables. So, which specific APIs will it be used with on Windows and on POSIX systems?
On Windows, the Wide APIs are already used throughout the code base, e.g. SetEnvironmentVariableW/_wenviron. If you need to find out the specific API for a specific functionality, please read the source code.
Or will utf-8b simply not be available on Windows at all?
It will be available, but it won't be used automatically for anything.
What happens if I create a Python version of tar, utf-8b strings slip in there, and I try to use them on Windows?
No need to create it - the tarfile module is already there. By "in there", do you mean on the file system, or in the tarfile?
You also assume that all Windows file system functions strictly conform to UTF-16 in practice (not just on paper). Have you verified that?
No, I don't assume that. I assume that all functions are strictly available in a Wide character version, and have verified that they are.
What's the situation on Windows CE?
I can't see how this question is relevant to the PEP. The PEP says this: # On Windows, Python uses the wide character APIs to access # character-oriented APIs, allowing direct conversion of the # environmental data to Python str objects. This is what it already does, and this is what it will continue to do.
Another question on Linux: what happens when I decode a file system path with utf-8b and then pass the resulting unicode string to Gnome? To Qt?
You probably get moji-bake, or an error, I didn't try.
To windows.forms? To Java?
How do you do that, on Linux?
To a unicode regular expression library?
You mean, SRE? SRE will match the code points as individual characters, class Cs. You should have been able to find out that for yourself.
To wprintf?
Depends on the wprintf implementation.
AFAIK, the behavior of most libraries is undefined for the kinds of unicode strings you construct, and it may be undefined in a bad way (crash, buffer overflow, whatever).
Indeed so. This is intentional. If you can crash Python that way, nothing gets worse by this PEP - you can then *already* crash Python in that way. Regards, Martin
On Windows, the Wide APIs are already used throughout the code base, e.g. SetEnvironmentVariableW/_wenviron. If you need to find out the specific API for a specific functionality, please read the source code. [...]
No, I don't assume that. I assume that all functions are strictly
available in a Wide character version, and have verified that they are.
The wide APIs use UTF-16. UTF-16 suffers from the same problem as UTF-8: not all sequences of words are valid UTF-16 sequences. In particular, sequences containing isolated surrogate pairs are not well-formed according to the Unicode standard. Therefore, the existence of a wide character API function does not guarantee that the wide character strings it returns can be converted into valid unicode strings. And, in fact, Windows Vista happily creates files with malformed UTF-16 encodings, and os.listdir() happily returns them.
If you can crash Python that way, nothing gets worse by this PEP - you can then *already* crash Python in that way.
Yes, but AFAIK, Python does not currently have functions that, as part of correct usage and normal operation, are intended to generate malformed unicode strings. Under your proposal, passing the output from a correctly implemented file system or other OS function to a correctly written library using unicode strings may crash Python. In order to avoid that, every library that's built into Python would have to be checked and updated to deal with both the Unicode standard and your extension to it. Tom
Thomas Breuel <tmbdev <at> gmail.com> writes:
And, in fact, Windows Vista happily creates files with malformed UTF-16
encodings, and os.listdir() happily returns them. The PEP won't change that, so what's the problem exactly?
Under your proposal, passing the output from a correctly implemented file system or other OS function to a correctly written library using unicode strings may crash Python.
That's a very dishonest formulation. It cannot crash Python; it can only crash hypothetical third-party programs or libraries with deficient error checking and unreasonable assumptions about input data. (and, of course, you haven't even proven those programs or libraries exist) Antoine.
It cannot crash Python; it can only crash hypothetical third-party programs or libraries with deficient error checking and unreasonable assumptions about input data.
The error checking isn't necessarily deficient. For example, a safe and legitimate thing to do is for third party libraries to throw a C++ exception, raise a Python exception, or delete the half surrogate. Any of those would break one of the use cases people have been talking about, namely being able to present the output from os.listdir() to the user, say in a file selector, and then access that file. (and, of course, you haven't even proven those programs or libraries exist)
PEP 383 is a proposal that suggests changing Python such that malformed unicode strings become a required part of Python and such that Pyhon writes illegal UTF-8 encodings to UTF-8 encoded file systems. Those are big changes, and it's legitimate to ask that PEP 383 address the implications of that choice before it's made. Tom
Thomas Breuel <tmbdev <at> gmail.com> writes:
The error checking isn't necessarily deficient. For example, a safe and
legitimate thing to do is for third party libraries to throw a C++ exception, raise a Python exception, or delete the half surrogate. Do you have any concrete examples of this behaviour? When e.g. Nautilus shows some illegal UTF-8 filenames in an UTF-8 locale, it replaces the offending bytes with placeholders rather than crash in your face.
PEP 383 is a proposal that suggests changing Python such that malformed unicode strings become a required part of Python and such that Pyhon writes illegal UTF-8 encodings to UTF-8 encoded file systems.
That's again a misleading statement. It only writes an "illegal encoding" if it received one from the filesystem in the first place. A clean filesystem will only receive clean filenames.
Those are big changes, and it's legitimate to ask that PEP 383 address the implications of that choice before it's made.
No, it's legitimate to ask that /you/ back up your arguments with concrete facts. It's difficult to demonstrate the non-existence of a problem. On the other hand, you can easily demonstrate that it exists, if it really does. By the way, most of those libraries under Unix would take a char * as input, so they wouldn't deal with an "illegal unicode string", they would deal with the original byte string. Regards Antoine.
The wide APIs use UTF-16. UTF-16 suffers from the same problem as UTF-8: not all sequences of words are valid UTF-16 sequences. In particular, sequences containing isolated surrogate pairs are not well-formed according to the Unicode standard. Therefore, the existence of a wide character API function does not guarantee that the wide character strings it returns can be converted into valid unicode strings. And, in fact, Windows Vista happily creates files with malformed UTF-16 encodings, and os.listdir() happily returns them.
Whatever. What does that have to do with PEP 383? Your claim was that PEP 383 may have unfortunate effects on Windows, and I'm telling you that it won't, because the behavior of Python on Windows won't change at all. So whatever the problem - it's there already, and the PEP is not going to change it. I personally don't see a problem here - *of course* os.listdir will report invalid utf-16 encodings, if that's what is stored on disk. It doesn't matter whether the file names are valid wrt. some specification. What matters is that you can access all the files. Regards, Martin
On Wed, Apr 29, 2009 at 07:45, "Martin v. Löwis" <martin@v.loewis.de> wrote:
Your claim was that PEP 383 may have unfortunate effects on Windows,
No, I simply think that PEP 383 is not sufficiently specified to be able to tell.
and I'm telling you that it won't, because the behavior of Python on Windows won't change at all.
A justification for your proposal is that there are differences between Python on UNIX and Windows that you would like to reduce. But depending on where you introduce utf-8b coding on UNIX, you may also have to introduce it on Windows in order to keep the platforms consistent. So whatever the problem - it's there already, and the
PEP is not going to change it.
OK, so you are saying that under PEP 383, utf-8b wouldn't be used anywhere on Windows by default. That's not clear from your proposal. It's also not clear from your proposal where utf-8b will get used on UNIX systems. Some of the places that have been suggested are: open, os.listdir, sys.argv, os.getenv. There are other potential ones, like print, write, and os.system. And what about text file and string conversions: will utf-8b become the default, or optional, or unavailable? Each of those choices potentially has significant implications. I'm just asking what those choices are so that one can then talk about the implications and see whether this proposal is a good one or whether other alternatives are better. Tom
On approximately 4/29/2009 12:17 AM, came the following characters from the keyboard of Martin v. Löwis:
OK, so you are saying that under PEP 383, utf-8b wouldn't be used anywhere on Windows by default. That's not clear from your proposal.
You didn't read it carefully enough. The first three paragraphs of the "Specification" section make that clear.
Sorry, rereading those paragraphs even with this declaration in mind, does not make that clear. It is not enough to have a solution that works; it is necessary to communicate that solution clearly enough that people understand it. By the huge amount of feedback you have received, it is clear that either the solution doesn't work, or that it wasn't communicated clearly. The following comments are an attempt to help you make the PEP clear, based on your above declaration that UTF-8b wouldn't be used on Windows. I may still be unclear about what you mean, but if you can accept these enhancements to the PEP, then maybe we are approaching a common understanding; if not, you should be aware that the PEP still needs clarification. In the first paragraph, you should make it clear that Python 3.0 does not use the Windows bytes interfaces, if it doesn't. "Python uses *only* the wide character APIs..." would suffice. As stated, it seems like Python *does* use the wide character APIs, but leaves open the possibility that it might use byte APIs also. A short description of what happens on Windows when Python code uses bytes APIs would also be helpful. In the second paragraph, it speaks of "currently" but then speaks of using the half-surrogates. I don't believe that happens "currently". You did change tense, but that paragraph is quite confusing, currently, because of the tense change. You should describe there, the action that is currently taken by Python for non-decodable byes, and then in the next paragraph talk about what the PEP changes. The 4th paragraph is now confusing too... would it not be the decode error handler that returns the byte strings, in addition to the Unicode strings? The 5th paragraph has apparently confused some people into thinking this PEP only applies to locale's using UTF-8 encodings; you should have an "else clause" to clear that up, pointing out that the reverse encoding of half-surrogates by other encodings already produces errors, that UTF-8 is a special case, not the only case. The code added to the discussion has mismatched (), making me wonder if it is complete. There is a reasonable possibility that only the final ) is missing. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
In the first paragraph, you should make it clear that Python 3.0 does not use the Windows bytes interfaces, if it doesn't. "Python uses *only* the wide character APIs..." would suffice.
That's not quite exact. It uses both ANSI and Wide APIs - depending on whether you pass bytes as input or strings. Please see the Python source code to find out how this works, and what that means.
As stated, it seems like Python *does* use the wide character APIs, but leaves open the possibility that it might use byte APIs also. A short description of what happens on Windows when Python code uses bytes APIs would also be helpful.
I'm at a loss how to make the text more clear than it already is. I'm really not good at writing long essays, with a lot of explanatory-but-non-normative text. I also think that explanations do not belong in the section titled specification, nor does a full description of the status quo belongs into the PEP at all. The reader should consult the current Python source code if in doubt what the status quo is.
In the second paragraph, it speaks of "currently" but then speaks of using the half-surrogates. I don't believe that happens "currently". You did change tense, but that paragraph is quite confusing, currently, because of the tense change. You should describe there, the action that is currently taken by Python for non-decodable byes, and then in the next paragraph talk about what the PEP changes.
Thanks, fixed.
The 4th paragraph is now confusing too... would it not be the decode error handler that returns the byte strings, in addition to the Unicode strings?
No, why do you think so? That's intended as stated.
The 5th paragraph has apparently confused some people into thinking this PEP only applies to locale's using UTF-8 encodings; you should have an "else clause" to clear that up, pointing out that the reverse encoding of half-surrogates by other encodings already produces errors, that UTF-8 is a special case, not the only case.
I have fixed that by extending the third paragraph.
The code added to the discussion has mismatched (), making me wonder if it is complete. There is a reasonable possibility that only the final ) is missing.
Indeed; this is now also fixed. Regards, Martin
On approximately 4/29/2009 1:06 PM, came the following characters from the keyboard of Martin v. Löwis:
Thanks, fixed.
Thanks for your fixes. They are helpful.
I'm at a loss how to make the text more clear than it already is. I'm really not good at writing long essays, with a lot of explanatory-but-non-normative text. I also think that explanations do not belong in the section titled specification, nor does a full description of the status quo belongs into the PEP at all. The reader should consult the current Python source code if in doubt what the status quo is.
The status quo is what justifies the existence of the PEP. If the status quo were perfect, there would be no need for the PEP. The status quo should be described in the Rationale. Some of it is. The rest of it should be. If there is a need for this PEP for POSIX, but not Windows, the reason why should be given (Para 2 in Rationale seems to try to describe that, but doesn't go far enough), and also the reason that cross-platform code can install this PEP's error handler on both platforms, yet it won't affect bytes interfaces on Windows. These are two omissions that have both caused large amounts of discussion. Attempting to understand the Python source code is a good thing, but there is a lot to understand, and few will achieve a full understanding.
The 4th paragraph is now confusing too... would it not be the decode error handler that returns the byte strings, in addition to the Unicode strings?
No, why do you think so? That's intended as stated.
Here, a use case, or several, in the PEP could help clarify why it would be the encode error handler that would return both the bytes string and the Unicode string. And why the decode error handler would not need to. Seems that if the decode handler preserved the bytes from the OS, and made them available as well as the decoded Unicode, that could be interesting to the application that is wanting to manipulate the file. Seems that if the encode handler is given the Unicode, so not clear why it should also return it. I guess if there is an error during the encode process (can there be?) then the bytes and Unicode for comparison could be useful for error reporting. But I shouldn't have to guess. The PEP should explain how these things are useful. The discussion section could be extended with use cases for both the encode and decode cases. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
On Wed, Apr 29, 2009, "Martin v. L?wis" wrote:
I'm at a loss how to make the text more clear than it already is. I'm really not good at writing long essays, with a lot of explanatory-but-non-normative text. I also think that explanations do not belong in the section titled specification, nor does a full description of the status quo belongs into the PEP at all. The reader should consult the current Python source code if in doubt what the status quo is.
Perhaps not a full description of the status quo, but the PEP definitely needs a good summary -- remember that PEPs are not just for the time that they are written, but also for the future. While telling people to "read the source, Luke" makes some sense at a specific point in time, I don't think that requiring a trawl through code history is fair. And, yes, PEP-writing is painful. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "If you think it's expensive to hire a professional to do the job, wait until you hire an amateur." --Red Adair
Perhaps not a full description of the status quo, but the PEP definitely needs a good summary
I completely agree, and believe that the PEP *does* have a good summary - it has both an abstract, and a rationale, and both say exactly what I want them to say. If people want them to say different things, they have to tell me what specifically they want it to say (perhaps even with specific formulations). If they can't communicate their requests to me, I can't comply. Regards, Martin
On Tue, 28 Apr 2009 at 09:30, Thomas Breuel wrote:
Therefore, when Python encounters path names on a file system that are not consistent with the (assumed) encoding for that file system, Python should raise an error.
This is what happens currently, and users are quite unhappy about it.
We need to keep "users" and "programmers" distinct here.
Programmers may find it inconvenient that they have to spend time figuring out and deal with platform-dependent file system encoding issues and errors. But internationalization and unicode are hard, that's just a fact of life.
And most programmers won't do it, because most programmers write for an English speaking audience and have no clue about unicode issues. That is probably slowly changing, but it is still true, I think.
End users, however, are going to be quite unhappy if they get a string of gibberish for a file name because you decided to interpret some non-Unicode string as UTF-8-with-extra-bytes.
No, end users expect the gibberish, because they get it all the time (at least on Unix) when dealing with international filenames. They expect to be able to manipulate such files _despite_ the gibberish. (I speak here as an end user who does this!!)
Or some Python program might copy files from an ISO8859-15 encoded file system to a UTF-8 encoded file system, and instead of getting an error when the encodings are set incorrectly, Python would quietly create ISO8859-15 encoded file names, making the target file system inconsistent.
As will almost all unix programs, and the unix OS itself. On Unix, you can't make the file system inconsistent by doing this, because filenames are just byte strings with no NULLs. How _does_ Windows handle this? Would a Windows program complain, or would it happily record the gibberish? I suspect the latter, but I don't use Windows so I don't know.
There is a lot of potential for major problems for end users with your proposals. In both cases, what should happen is that the end user gets an error, submits a bug, and the programmer figures out how to deal with the encoding issues correctly.
What would actually happen is that the user would abandon the program that didn't work for one (not written in Python) that did. If the programmer was lucky they'd get a bug report, which they wouldn't be able to do anything about since Python wouldn't be providing the tools to let them fix it (ie: there are currently no bytes interfaces for environ or the command line in python3).
Yes, users can do that (to a degree), but they are still unhappy about it. The approach actually fails for command line arguments
As it should: if I give an ISO8859-15 encoded command line argument to a Python program that expects a UTF-8 encoding, the Python program should tell me that there is something wrong when it notices that. Quietly continuing is the wrong thing to do.
Imagine you are on a unix system, and you have gotten from somewhere a file whose name is encoded in something other than UTF-8 (I have a number of those on my system). Now imagine that I want to run a python program against that file, passing the name in on the command line. I type the program name, the first few (non-mangled) characters, and hit tab for completion, and my shell automagically puts the escaped bytes onto the command line. Or perhaps I cut and paste from an 'ls' listing into a quoted string on the command line. Python is now getting the mangled filename passed in on the command line, and if the python program can't manipulate that file like any other file on my disk I am going to be mightily pissed. This is the _reality_ of current unix systems, like it or not. The same apparently applies to Windows, though in that case the mangled names may be fewer and you tend to pick them from a GUI interface rather than do cut-and-paste or tab completion.
If we follow your approach, that ISO8859-15 string will get turned into an escaped unicode string inside Python. If I understand your proposal correctly, if it's a output file name and gets passed to Python's open function, Python will then decode that string and end up with an ISO8859-15 byte sequence, which it will write to disk literally, even if the encoding for the system is UTF-8. That's the wrong thing to do.
Right. Like I said, that's what most (almost all) Unix/Linux programs _do_. Now, in some future world where everyone (including Windows) acts like we are hearing OS/X does and rejects the garbled encoding _at the OS level_, then we'd be able to trust the file system encoding (FSDO trust) and there would be no need for this PEP or any similar solution. --David
If we follow your approach, that ISO8859-15 string will get turned into an escaped unicode string inside Python. If I understand your proposal correctly, if it's a output file name and gets passed to Python's open function, Python will then decode that string and end up with an ISO8859-15 byte sequence, which it will write to disk literally, even if the encoding for the system is UTF-8. That's the wrong thing to do.
I don't think anything can, or should be, done about that. If you had byte-oriented interfaces (as you do in 2.x), exactly the same thing will happen: the name of the file will be the very same byte sequence as the one passed on the command line. Most Unix users here agree that this is the right thing to happen. Regards, Martin
participants (10)
-
"Martin v. Löwis"
-
Aahz
-
Antoine Pitrou
-
Glenn Linderman
-
Lino Mastrodomenico
-
MRAB
-
Oleg Broytmann
-
R. David Murray
-
Stephen J. Turnbull
-
Thomas Breuel