Re: [Python-Dev] [Python-3000] New proposition for Python3 bytes filename issue
Le Monday 29 September 2008 18:45:28 Georg Brandl, vous avez écrit :
If I had to choose, I'd still argue for the modified UTF-8 as filesystem encoding (if it were UTF-8 otherwise), despite possible surprises when a such-encoded filename escapes from Python.
If I understand correctly this solution. The idea is to change the default file system encoding, right? Eg. if your filesystem is UTF-8, use ISO-8859-1 to make sure that UTF-8 conversion will never fail. Let's try with an ugly directory on my UTF-8 file system: $ find . ./têste ./ô ./a?b ./dossié ./dossié/abc ./dir?name ./dir?name/xyz Python3 using encoding=ISO-8859-1:
import os; os.listdir(b'.') [b't\xc3\xaaste', b'\xc3\xb4', b'a\xffb', b'dossi\xc3\xa9', b'dir\xffname'] files=os.listdir('.'); files ['têste', 'ô', 'aÿb', 'dossié', 'dirÿname'] open(files[0]).close() os.listdir(files[-1]) ['xyz']
Ok, I have unicode filenames and I'm able to open a file and list a directory. The problem is now to display correctly the filenames. For me "unicode" sounds like "text (characters) encoded in the correct charset". In this case, unicode is just a storage for *bytes* in a custom charset. How can we mix <custom unicode (bytes encoded in ISO-8859-1)> with <real unicode>? Eg. os.path.join('dossié', "fichié") : first argument is encoded in ISO-8859-1 whereas the second argument is encoding in Unicode. It's something like that: str(b'dossi\xc3\xa9', 'ISO-8859-1') + '/' + 'fichi\xe9' Whereas the correct (unicode) result should be: 'dossié/fichié' as bytes in ISO-8859-1: b'dossi\xc3\xa9/fichi\xc3\xa9' as bytes in UTF-8: b'dossi\xe9/fichi\xe9' Change the default file system encoding to store bytes in Unicode is like introducing a new Python type: <fake Unicode for filename hacks>. -- Victor Stinner aka haypo http://www.haypocalc.com/blog/
Change the default file system encoding to store bytes in Unicode is like introducing a new Python type: <fake Unicode for filename hacks>.
Exactly. Seems like the best solution to me, despite your polemics. Regards, Martin
On 2008-09-30 08:00, Martin v. Löwis wrote:
Change the default file system encoding to store bytes in Unicode is like introducing a new Python type: <fake Unicode for filename hacks>.
Exactly. Seems like the best solution to me, despite your polemics.
Not a bad idea... have os.listdir() return Unicode subclasses that work like file handles, ie. they have an extra buffer that holds the original bytes value received from the underlying C API. Passing these handles to open() would then do the right thing by using whatever os.listdir() got back from the file system to open the file, while still providing a sane way to display the filename, e.g. using question marks for the invalid characters. The only problem with this approach is concatenation of such handles to form pathnames, but then perhaps those concatenations could just work on the bytes value as well (I don't know of any OS that uses non- ASCII path separators). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 30 2008)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611
On Tue, Sep 30, 2008 at 3:31 AM, M.-A. Lemburg <mal@egenix.com> wrote:
On 2008-09-30 08:00, Martin v. Löwis wrote:
Change the default file system encoding to store bytes in Unicode is like introducing a new Python type: <fake Unicode for filename hacks>.
Exactly. Seems like the best solution to me, despite your polemics.
Not a bad idea... have os.listdir() return Unicode subclasses that work like file handles, ie. they have an extra buffer that holds the original bytes value received from the underlying C API.
Passing these handles to open() would then do the right thing by using whatever os.listdir() got back from the file system to open the file, while still providing a sane way to display the filename, e.g. using question marks for the invalid characters.
The only problem with this approach is concatenation of such handles to form pathnames, but then perhaps those concatenations could just work on the bytes value as well (I don't know of any OS that uses non- ASCII path separators).
While this seems to work superficially I expect an infinite number of problems caused by code that doesn't understand this subclass. You are hinting at this in your last paragraph. -- --Guido van Rossum (home page: http://www.python.org/~guido/)
On 2008-09-30 16:05, Guido van Rossum wrote:
On Tue, Sep 30, 2008 at 3:31 AM, M.-A. Lemburg <mal@egenix.com> wrote:
Change the default file system encoding to store bytes in Unicode is like introducing a new Python type: <fake Unicode for filename hacks>. Exactly. Seems like the best solution to me, despite your polemics. Not a bad idea... have os.listdir() return Unicode subclasses that work
On 2008-09-30 08:00, Martin v. Löwis wrote: like file handles, ie. they have an extra buffer that holds the original bytes value received from the underlying C API.
Passing these handles to open() would then do the right thing by using whatever os.listdir() got back from the file system to open the file, while still providing a sane way to display the filename, e.g. using question marks for the invalid characters.
The only problem with this approach is concatenation of such handles to form pathnames, but then perhaps those concatenations could just work on the bytes value as well (I don't know of any OS that uses non- ASCII path separators).
While this seems to work superficially I expect an infinite number of problems caused by code that doesn't understand this subclass. You are hinting at this in your last paragraph.
Well, to some extent Unicode objects themselves already implement such a strategy: the default encoded bytes object basically provides the low-level interfacing value. But I agree, the approach is not foolproof. In the end, I think it's better not to be clever and just return the filenames that cannot be decoded as bytes objects in os.listdir(). Passing those to open() will then open the files as expected, in most other cases the application will have to provide explicit conversions in whatever way best fits the application. Also note that os.listdir() isn't the only source of filesnames. You often read them from a file, a database, some socket, etc, so letting the application decide what to do is not asking too much, IMHO. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 30 2008)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611
On Tue, Sep 30, 2008 at 8:20 AM, M.-A. Lemburg <mal@egenix.com> wrote:
In the end, I think it's better not to be clever and just return the filenames that cannot be decoded as bytes objects in os.listdir().
Unfortunately that's going to break most code that is using os.listdir(), so it's hardly an improved experience.
Passing those to open() will then open the files as expected, in most other cases the application will have to provide explicit conversions in whatever way best fits the application.
In most cases the app will try to concatenate a pathname given as a string and then it will fail.
Also note that os.listdir() isn't the only source of filesnames. You often read them from a file, a database, some socket, etc, so letting the application decide what to do is not asking too much, IMHO.
In all those cases, the code that reads them is responsible for picking an encoding or relying on a default encoding, and the resulting filenames are always expressed as text, not bytes. I don't think it's the same at all. -- --Guido van Rossum (home page: http://www.python.org/~guido/)
On 2008-09-30 18:46, Guido van Rossum wrote:
On Tue, Sep 30, 2008 at 8:20 AM, M.-A. Lemburg <mal@egenix.com> wrote:
In the end, I think it's better not to be clever and just return the filenames that cannot be decoded as bytes objects in os.listdir().
Unfortunately that's going to break most code that is using os.listdir(), so it's hardly an improved experience.
Right, but this also signals a problem to the application and the application is in the best position to determine a proper work-around.
Passing those to open() will then open the files as expected, in most other cases the application will have to provide explicit conversions in whatever way best fits the application.
In most cases the app will try to concatenate a pathname given as a string and then it will fail.
True, and that's the right thing to do in those cases. The application will have to deal with the problem, e.g. convert the path to bytes and retry the joining, or convert the bytes string to Latin-1 and then convert the result back to bytes (using Latin-1) for passing it to open() (which will of course only work if there are no non-Latin-1 characters in the path dir), or apply a different filename encoding based on the path and then retry to convert the bytes filename into Unicode, or ask the user what to do, etc. There are many possibilities to solve the problem, apply a work-around, or inform the user of ways to correct it.
Also note that os.listdir() isn't the only source of filesnames. You often read them from a file, a database, some socket, etc, so letting the application decide what to do is not asking too much, IMHO.
In all those cases, the code that reads them is responsible for picking an encoding or relying on a default encoding, and the resulting filenames are always expressed as text, not bytes. I don't think it's the same at all.
What I was trying to say is that you run into the same problem in other places as well. Trying to have os.listdir() implement some strategy is not going to solve the problem at large. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 30 2008)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611
M.-A. Lemburg wrote:
In the end, I think it's better not to be clever and just return the filenames that cannot be decoded as bytes objects in os.listdir().
But since it's a rare occurrence, most applications are just going to ignore the issue, and then fail unexpectedly one day on some unsuspecting user that doesn't have the inclination to go diving into the code to fix it. -- Greg
On Tuesday 30 September 2008, M.-A. Lemburg wrote:
On 2008-09-30 08:00, Martin v. Löwis wrote:
Change the default file system encoding to store bytes in Unicode is like introducing a new Python type: <fake Unicode for filename hacks>.
Exactly. Seems like the best solution to me, despite your polemics.
Not a bad idea... have os.listdir() return Unicode subclasses that work like file handles, ie. they have an extra buffer that holds the original bytes value received from the underlying C API.
Why does it have to be a Unicode subclass? In my eyes, a Unicode object promises a few things, in particular that it contains a Unicode string. If it now suddenly contains bytes without any further meaning, that would be bad. What I wonder is what the requirements on path handling are. I'll try to list the ones I can see: 1. A path received from the system should be preserved, so it can be given to the system later on. IOW, the internal representation should not loose any information compared to the one used by the OS. 2. Typical operations like joining two path segments or moving to the parent dir should be defined. 3. There must be a way to display the path to the user. IOW, there should be a way to turn the path into a string that the user can recognise, according to some encoding. Note that this is not always possible, so this can fail. 4. There must be a way to receive a path from the user. That means that there must be a way from a user-entered string to a path. Note that this, too, isn't always possible and can fail. 5. The conversion between a string and a path should be configurable, defaults retrieved from the system. This is so that most operations will just work and do the thing that the user expects. 6. There should be a way to modify the path data itself. This of course requires knowledge about the internals but gives full power to the programmer. For requirement 3, I would say a lossy conversion to a string would be enough, i.e. try to convert the path to a Unicode string and use a question mark or some escaping to mark parts that can't be decoded. It will allow users to recognise the decodeable parts of the path with hopefully just a few characters left without decoding. For requirement 4, a failure to encode a string to a path must result in a loud failure, i.e. an exception. This is because the user entered a path that we can't use, any guessing what the user might have wanted is futile. Are there any points to add? Uli -- Sator Laser GmbH Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932 ************************************************************************************** Visit our website at <http://www.satorlaser.de/> ************************************************************************************** Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, weitergeleitet, veröffentlicht oder anderweitig benutzt werden. E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht verantwortlich. **************************************************************************************
On 2008-10-01 09:54, Ulrich Eckhardt wrote:
On Tuesday 30 September 2008, M.-A. Lemburg wrote:
Change the default file system encoding to store bytes in Unicode is like introducing a new Python type: <fake Unicode for filename hacks>. Exactly. Seems like the best solution to me, despite your polemics. Not a bad idea... have os.listdir() return Unicode subclasses that work
On 2008-09-30 08:00, Martin v. Löwis wrote: like file handles, ie. they have an extra buffer that holds the original bytes value received from the underlying C API.
Why does it have to be a Unicode subclass? In my eyes, a Unicode object promises a few things, in particular that it contains a Unicode string. If it now suddenly contains bytes without any further meaning, that would be bad.
Please read my entire email. I was proposing to store the underlying non-decodeable byte string value in such a subclass. The Unicode value of the object would then be that underlying value decoded as e.g. Latin-1 in order to be able to work on it as text. Path operations would have to be made aware of such subclasses and operate on the underlying bytes value. However, like Guido mentioned, this only works if all components are indeed aware of such subclasses... and that's likely to fail for code outside the stdlib. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 01 2008)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611
M.-A. Lemburg <mal@egenix.com> wrote:
On 2008-10-01 09:54, Ulrich Eckhardt wrote:
On Tuesday 30 September 2008, M.-A. Lemburg wrote:
Change the default file system encoding to store bytes in Unicode is like introducing a new Python type: <fake Unicode for filename hacks>. Exactly. Seems like the best solution to me, despite your polemics. Not a bad idea... have os.listdir() return Unicode subclasses that work
On 2008-09-30 08:00, Martin v. Löwis wrote: like file handles, ie. they have an extra buffer that holds the original bytes value received from the underlying C API.
Why does it have to be a Unicode subclass? In my eyes, a Unicode object promises a few things, in particular that it contains a Unicode string. If it now suddenly contains bytes without any further meaning, that would be bad.
Please read my entire email. I was proposing to store the underlying non-decodeable byte string value in such a subclass. The Unicode value of the object would then be that underlying value decoded as e.g. Latin-1 in order to be able to work on it as text.
I'm actually sort of liking this idea. A Pathname class, for convenience a subtype of String, but containing the underlying binary representation used by the OS. Even non-unicode pathnames could be represented. Bill
On 03:54 pm, janssen@parc.com wrote:
I'm actually sort of liking this idea. A Pathname class, for convenience a subtype of String, but containing the underlying binary representation used by the OS. Even non-unicode pathnames could be represented.
On the one hand, I agree with you - except for the part where it's a subtype of String, that doesn't work. In case I haven't mentioned it enough times already: http://twistedmatrix.com/documents/8.1.0/api/twisted.python.filepath.FilePat... On the other hand, we've all been on this merry-go-round before: http://www.python.org/dev/peps/pep-0355/ Note especially the rejection notice: "Subclassing from str is a particularly bad idea". Again, one day I'd really like to add one of these to Python. Now is not the time.
glyph@divmod.com wrote:
I'm actually sort of liking this idea. A Pathname class, for convenience a subtype of String, but containing the underlying binary representation used by the OS. Even non-unicode pathnames could be represented.
On the one hand, I agree with you - except for the part where it's a subtype of String, that doesn't work. In case I haven't mentioned it enough times already:
http://twistedmatrix.com/documents/8.1.0/api/twisted.python.filepath.FilePat...
On the other hand, we've all been on this merry-go-round before:
http://www.python.org/dev/peps/pep-0355/
Note especially the rejection notice: "Subclassing from str is a particularly bad idea".
Yes, the only real justification for it is to not break existing code (otherwise, calling str() is not that much of an ordeal).
On the other hand, we've all been on this merry-go-round before:
The very existence of os.path seems a good argument that something like this is useful. Perhaps PEP 355 just went too far. Bill
Bill Janssen wrote:
Perhaps PEP 355 just went too far.
That was certainly one of the major objections to it. A filesystem path object which didn't try to combine a half-dozen different modules into methods on a single object, but instead focused on solving a few specific problems with using raw strings as file paths would have a far greater chance of acceptance. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org
On Mon, Sep 29, 2008 at 11:00 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
Change the default file system encoding to store bytes in Unicode is like introducing a new Python type: <fake Unicode for filename hacks>.
Exactly. Seems like the best solution to me, despite your polemics.
Martin, I don't understand why you are in favor of storing raw bytes encoded as Latin-1 in Unicode string objects, which clearly gives rise to mojibake. In the past you have always been staunchly opposed to API changes or practices that could lead to mojibake (and you had me quite convinced). -- --Guido van Rossum (home page: http://www.python.org/~guido/)
Le Tuesday 30 September 2008 15:53:09 Guido van Rossum, vous avez écrit :
On Mon, Sep 29, 2008 at 11:00 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
Change the default file system encoding to store bytes in Unicode is like introducing a new Python type: <fake Unicode for filename hacks>.
Exactly. Seems like the best solution to me, despite your polemics.
Martin, I don't understand why you are in favor of storing raw bytes encoded as Latin-1 in Unicode string objects, which clearly gives rise to mojibake. In the past you have always been staunchly opposed to API changes or practices that could lead to mojibake (and you had me quite convinced).
If I understood correctly, the goal of Python3 is the clear *separation* of bytes and characters. Store bytes in Unicode is pratical because it doesn't need to change the existing code, but it doesn't fix the problem, it's just move problems which be raised later. I didn't get an answer to my question: what is the result <bytes (fake characters) stored in unicode> + <real unicode>? I guess that the result is <mixed "bytes" and characters in unicode> instead of raising an error (invalid types). So again: why introducing a new type instead of reusing existing Python types? -- Victor Stinner aka haypo http://www.haypocalc.com/blog/
I didn't get an answer to my question: what is the result <bytes (fake characters) stored in unicode> + <real unicode>? I guess that the result is <mixed "bytes" and characters in unicode> instead of raising an error (invalid types). So again: why introducing a new type instead of reusing existing Python types?
I didn't mean to introduce a new data type in the strict sense - merely to pass through undecodable bytes through the regular Unicode type. So the result of adding them is a regular Unicode string. Regards, Martin
Guido van Rossum wrote:
On Mon, Sep 29, 2008 at 11:00 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
Change the default file system encoding to store bytes in Unicode is like introducing a new Python type: <fake Unicode for filename hacks>. Exactly. Seems like the best solution to me, despite your polemics.
Martin, I don't understand why you are in favor of storing raw bytes encoded as Latin-1 in Unicode string objects, which clearly gives rise to mojibake. In the past you have always been staunchly opposed to API changes or practices that could lead to mojibake (and you had me quite convinced).
True. I try to outweigh the need for simplicity in the API against the need to support all cases. So I see two solutions: a) support bytes as file names. Supports all cases, but complicates the API very much, by pervasively bringing bytes into the status of a character data type. IMO, this must be prevented at all costs. b) make character (Unicode) strings the only string type. Does not immediately support all cases, so some hacks are needed. However, even with the hacks, it preserves the simplicity of the API; the hacks then should ideally be limited to the applications that need it. On this side, I see the following approaches: 1. try to automatically embed non-representable characters into the Unicode strings, e.g. by using PUA characters. Reduces the amount of moji-bake, but produces a lot of difficult issues. 2. let applications that desire so access all file names in a uniform manner, at the cost of producing tons of moji-bake In this case, I think moji-bake is unavoidable: it is just a plain flaw in the POSIX implementations (not the API or specification) that you can run into file names where you can't come up with the right rendering. Even for solution a), the resulting data cannot be displayed "correctly" in all cases. Currently, I favor b2, but haven't given up on b1, and they don't exclude each other. b2 is simple to implement, and delegates the choice between legible file names and universal access to all files to the application. Given the way Unix works, this is the most sensible choice, IMO: by default, Python should try to make file names legible, but stuff like backup applications should be implementable also - and they don't need legible file names. I think option a) will hunt us forever. People will ask for more and more features in the bytes type, eventually asking "give us Python 2.x strings back". It already starts: see #3982, where Benjamin asks to have .format added to bytes (for a reason unrelated to file names). Regards, Martin
On Tue, Sep 30, 2008 at 1:04 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
Guido van Rossum wrote:
On Mon, Sep 29, 2008 at 11:00 PM, "Martin v. Löwis" <martin@v.loewis.de> wrote:
Change the default file system encoding to store bytes in Unicode is like introducing a new Python type: <fake Unicode for filename hacks>. Exactly. Seems like the best solution to me, despite your polemics.
Martin, I don't understand why you are in favor of storing raw bytes encoded as Latin-1 in Unicode string objects, which clearly gives rise to mojibake. In the past you have always been staunchly opposed to API changes or practices that could lead to mojibake (and you had me quite convinced).
True. I try to outweigh the need for simplicity in the API against the need to support all cases. So I see two solutions:
a) support bytes as file names. Supports all cases, but complicates the API very much, by pervasively bringing bytes into the status of a character data type. IMO, this must be prevented at all costs.
That's a matter of opinion. I would also like to point out that it is in fact already supported by the system calls. io.open() doesn't, but that's a wrapper around _fileio._FileIO which does support bytes. All other syscalls already do the right thing (even readlink()!) except os.listdir(), which returns a mixture of bytes and str values (which is horrible) and os.getcwd() which needs a bytes equivalent. Victor's patch addresses all these issues. Victor's patch also tries to fix glob.py, fnmatch.py, and posixpath.py. That is more debatable, because this might be the start of a never-ending project. OTOH we have precedents, e.g. the re module similarly supports both bytes and unicode (and makes an effort to avoid mixing them).
b) make character (Unicode) strings the only string type. Does not immediately support all cases, so some hacks are needed. However, even with the hacks, it preserves the simplicity of the API; the hacks then should ideally be limited to the applications that need it. On this side, I see the following approaches: 1. try to automatically embed non-representable characters into the Unicode strings, e.g. by using PUA characters. Reduces the amount of moji-bake, but produces a lot of difficult issues. 2. let applications that desire so access all file names in a uniform manner, at the cost of producing tons of moji-bake
In this case, I think moji-bake is unavoidable: it is just a plain flaw in the POSIX implementations (not the API or specification) that you can run into file names where you can't come up with the right rendering. Even for solution a), the resulting data cannot be displayed "correctly" in all cases.
But I still like the ultimate solution to displaying names for (a) better: if it's not decodable, display it as the repr() of a bytes object. (Which happens to be its str() as well.)
Currently, I favor b2, but haven't given up on b1, and they don't exclude each other. b2 is simple to implement, and delegates the choice between legible file names and universal access to all files to the application. Given the way Unix works, this is the most sensible choice, IMO: by default, Python should try to make file names legible, but stuff like backup applications should be implementable also - and they don't need legible file names.
I don't believe that an application-wide choice is safe. For example the tempfile module manipulates filenames (at least for NamedTemporaryFile) and I think it would be wrong if it were affected by such a global setting. (E.g. the user could pass a suffix argument containing Unicode characters outside Latin-1.)
I think option a) will hunt us forever. People will ask for more and more features in the bytes type, eventually asking "give us Python 2.x strings back". It already starts: see #3982, where Benjamin asks to have .format added to bytes (for a reason unrelated to file names).
I'm not so worried about feature requests for the bytes type unrelated to filesystems; we can either grant them or not, and I am actually in many cases in favor of granting them -- just like we support bytes in the re module as I already mentioned above. The bytes and str types have intentionally similar APIs, because they have similar structure, and even somewhat similar semantics (b'ABC' and 'ABC' have related meanings even if there are subtle differences). I am also encouraged by Glyph's support for (a). He has a lot of practical experience. -- --Guido van Rossum (home page: http://www.python.org/~guido/)
On 30 Sep, 09:22 pm, guido@python.org wrote:
On Tue, Sep 30, 2008 at 1:04 PM, "Martin v. L�wis" <martin@v.loewis.de> wrote:
Guido van Rossum wrote:
On Mon, Sep 29, 2008 at 11:00 PM, "Martin v. L�wis" <martin@v.loewis.de> wrote:
Martin, I don't understand why you are in favor of storing raw bytes encoded as Latin-1 in Unicode string objects, which clearly gives rise to mojibake.
This is my word of the day, by the way. Reading this whole thread was _totally_ worth it to learn about "mojibake". Obviously I'm familiar with the phenomenon but somehow I'd never heard this awesome term before.
I am also encouraged by Glyph's support for (a). He has a lot of practical experience.
Thanks for the vote of confidence. I hope for all our sakes that you're not over-valuing that experience ;-). For what it's worth, I can see MvL's point in that I think there is some danger in generating confusion by adding _too many_ string-like functions to the bytes type. I don't want my suggestion to contribute to the confusion between bytes and text. However, Martin, I can promise you that I will _never_ ask for any convenience functions related to bytes as a result of this decision. I want bytes to come back from filesystem APIs because I intend to have a wrapper layer which knows two things about the file: the bytes (which are needed to talk to POSIX filesystem APIs) and the characters (which are computed from those bytes, can be safely renormalized, displayed to users, etc). On Windows this filesystem wrapper will necessarily behave differently, and will not use bytes for anything. Any formatting beyond joining path segments together and possibly splitting extensions off will be done on character strings, not byte strings. The proposal of using U+0000 seems like it would have been almost the same from such a wrapper's perspective, except (A) people using the filesystem APIs without the benefit of such a wrapper would have been even more screwed, and (B) there are a few nasty corner-cases when dealing with surrogate (i.e. invalid, in UTF-8) code points which I'm not quite sure what it would have done with. Guido already mentioned "libraries" as a hypothetical issue, but here's a real-world problem that results from putting NULLs into filenames. Consider this program: import gtk w = gtk.Window() b = gtk.Button(u"\u0000/hello/world") w.add(b) w.show_all() gtk.main() which emits this message: TypeError: OGtkButton.__init__() argument 1 must be string without null bytes or None, not unicode SQLite has a similar problem with NULLs, and I'm definitely sticking paths in there, too. Eventually I'd like to propose such a path type for inclusion in the stdlib, but that will have to wait for issues like <http://twistedmatrix.com/trac/ticket/2366> to be resolved.
On Tue, Sep 30, 2008 at 8:06 PM, <glyph@divmod.com> wrote:
The proposal of using U+0000 seems like it would have been almost the same from such a wrapper's perspective, except (A) people using the filesystem APIs without the benefit of such a wrapper would have been even more screwed, and (B) there are a few nasty corner-cases when dealing with surrogate (i.e. invalid, in UTF-8) code points which I'm not quite sure what it would have done with.
Surrogates in UTF-8 *should* be treated as errors, but current python is far too lax. That actually leads to another problem: improving validating will change what gets escaped and what doesn't. http://bugs.python.org/issue3297 http://bugs.python.org/issue3672 -- Adam Olsen, aka Rhamphoryncus
On Sep 30, 2008, at 10:06 PM, glyph@divmod.com wrote:
However, Martin, I can promise you that I will _never_ ask for any convenience functions related to bytes as a result of this decision. I want bytes to come back from filesystem APIs because I intend to have a wrapper layer which knows two things about the file: the bytes (which are needed to talk to POSIX filesystem APIs) and the characters (which are computed from those bytes, can be safely renormalized, displayed to users, etc). On Windows this filesystem wrapper will necessarily behave differently, and will not use bytes for anything. Any formatting beyond joining path segments together and possibly splitting extensions off will be done on character strings, not byte strings.
Can you clarify what proposal you are supporting for Python: 1) Two sets of APIs, one returning unicode strings, and one returning bytestrings. (subpoints: what does the unicode-returning API do when it cannot decode the bytestring into unicode? raise exception, pretend argument/envvar/file didn't exist/?) or 2) All APIs return bytestrings only. Converting to unicode is considered lossy, and would have to be done by applications for display purposes only. I really don't understand the reasoning for (1). It seems to me that most software (probably including all of the Python stdlib) would continue to use the unicode string API. Switching all of the Python stdlib to use the bytestring APIs instead would certainly be a large undertaking, and would have all sorts of ripple-on API changes (e.g. __file__). So I can only imagine that if you're proposing (1), you're doing so without the intention of suggesting that Python be converted to use it. And so, of course, that doesn't really fix things (such as getcwd failing if your cwd is a path that is undecodeable in the current locale, or well, currently, python refusing to even start). If you're proposing (2), it's at least as large an undertaking as (1) + converting Python to use the optional bytestring APIs. But at least it avoids exposing an API that people ought not use, and does make it obvious what still needs to be fixed: the unfixed code simply won't run at all.
The proposal of using U+0000 seems like it would have been almost the same from such a wrapper's perspective, except (A) people using the filesystem APIs without the benefit of such a wrapper would have been even more screwed
I'm not sure what your "more screwed" is comparing against: current py3k behavior? (aka: decoding to Unicode in locale's specified encoding)? I don't see how you can really be more screwed than that: not only can't you send your filename to display in a Gtk+ button, you can't access it at all, even staying within python.
and (B) there are a few nasty corner-cases when dealing with surrogate (i.e. invalid, in UTF-8) code points which I'm not quite sure what it would have done with.
The lone-surrogate-pair proposal was a totally different proposal than the U+0000 one. James
On 03:32 am, foom@fuhm.net wrote:
On Sep 30, 2008, at 10:06 PM, glyph@divmod.com wrote:
Can you clarify what proposal you are supporting for Python:
Sure. Neither of your descriptions is terribly accurate, but I'll try to explain.
1) Two sets of APIs, one returning unicode strings, and one returning bytestrings. (subpoints: what does the unicode-returning API do when it cannot decode the bytestring into unicode? raise exception, pretend argument/envvar/file didn't exist/?)
The only API discussed so far which would actually provide two variants is 'getcwd', which would have a 'getcwdb' that gives back bytes instead. Pretty much every other API takes some kind of input. listdir(bytes) would give back bytes, while listdir(text) would give back text. listdir(text) would skip undecodable filenames. Similarly for all the other APIs in os and os.path that take pathnames for input.
2) All APIs return bytestrings only. Converting to unicode is considered lossy, and would have to be done by applications for display purposes only.
This is a bad way to do things, because on Windows, filenames *really are* unicode. Converting to bytes is what's lossy. (See previous discussion of active codepages and CreateFileA/CreateFileW.)
I really don't understand the reasoning for (1).
The reasoning is that a lot of software doesn't care if it's wrong for edge cases, it's really hard to come up with something that's correct with respect to all of those edge cases (absurdly difficult, if you need to stay in the straightjacket of string / bytes types, as well as provide a useful library interface - which is why we're having this discussion). But, it should be _possible_ to write software that's correct in the face of those edge cases. And - let's not forget this - the worlds of POSIX and Windows really are different and really do require subtly different inputs. Python can try to paper over this like Java does and make it impossible to write certain classes of application, or it can just provide an ugly, slightly inconsistent API that exposes the ugly, slightly inconsistent reality. Modulo the issues you've raised which I don't think the proposal totally covers yet (abspath with a non-decodable cwd) I think it strikes a nice balance; allow people to live in the delusion of unicode-on-POSIX and have software that mostly works, most of the time, or allow them to face the unpleasantness and spend the effort to get something really solid. I think the _right_ answer to all of this is to (A) make FilePath work completely correctly for every totally insane edge case ever, and (B) include it in the stdlib. One day I think we'll do that. But nobody has the time or energy to do even the first part of that *right now*, before 3.0 is released, so I'm just looking for something which it will be possible to build FilePath, or something like it, on top of, without breaking other people's applications who rely on the os module directly too badly.
It seems to me that most software (probably including all of the Python stdlib) would continue to use the unicode string API.
That's true. And that software wouldn't handle these edge cases completely correctly. As Guido put it, "it's a quality of implementation issue".
Switching all of the Python stdlib to use the bytestring APIs instead would certainly be a large undertaking, and would have all sorts of ripple-on API changes (e.g. __file__).
I am not quite sure what to do about __file__. My preference would probably be to use unicode filename for consistency so it can always be displayed, but provide a second attribute (__open_file__?) that would be sometimes unicode, sometimes bytes, which would be guaranteed to work with open(). I suspect that most software which interacts with __file__ on a deep level would be of the variety which would deal with the edge cases. But where the Python stdlib wants a pathname it should be accepting either bytes or unicode, as all of the os.path functions want. This does kind of suck, but the alternatives are to encode crazy extra information in unicode path names that cannot be exchanged with other programs (or with users: NULL is potentially the worst bogus character from a UI perspective), or revert to bytes for everything (which is a non-solution, c.f. Windows above).
So I can only imagine that if you're proposing (1), you're doing so without the intention of suggesting that Python be converted to use it.
Maybe updating the stdlib to be correct in the face of such changes is hard, but it doesn't seem intractible. Taken together, it looks like there are only about 100 calls in the stdlib to both getcwd and abspath together, and I suspect many of them are for purely aesthetic purposes and could just be eliminated, and many of them are redefinitions of the functions and don't need any changes. All the other path manipulation functions would continue to work as-is, although some of them might skip undecodable files.
And so, of course, that doesn't really fix things (such as getcwd failing if your cwd is a path that is undecodeable in the current locale, or well, currently, python refusing to even start).
The proposal as I understand it so far doesn't address this specifically, so I'll try to. os.getcwd, os.path.abspath, and os.path.realpath (when called with unicode) will probably need to do something gross if they're called on a non-decodable directory. One thing that comes to mind is to create a temporary symbolic link and return u'/tmp/python-$YOURUID-undecodable/$GUID/something'. I hope someone else has a better idea, especially since that sort of defeats the purpose of realpath. On the other hand, even this strawman answer is correct for pretty much any sane purpose, and if you _really_ care, you need to learn that you have to use and ask for bytes, on POSIX, to deal with such corner cases.
If you're proposing (2), (...)
Luckily I'm not.
The proposal of using U+0000 seems like it would have been almost the same from such a wrapper's perspective, except (A) people using the filesystem APIs without the benefit of such a wrapper would have been even more screwed
I'm not sure what your "more screwed" is comparing against: current py3k behavior? (aka: decoding to Unicode in locale's specified encoding)? I don't see how you can really be more screwed than that: not only can't you send your filename to display in a Gtk+ button, you can't access it at all, even staying within python.
You're screwed if you're trying to access files in a portable way without worrying at all about encodings. There are files you won't be able to access, there are conditions you won't be able to deal with. Sorry, but POSIX sucks and that's life. You're _more_ screwed if you're trying to access those files in a portable way without worrying about encodings, and the API you're using is giving you back invalid, magic path names, with NULLs rather than being slightly lossy and dropping filenames you (obviously, by virtue of the way you requested those filenames) won't be able to deal with. So I was talking here about the default behavior in the case of a naive program that wants to pretend all paths are unicode.
and (B) there are a few nasty corner-cases when dealing with surrogate (i.e. invalid, in UTF-8) code points which I'm not quite sure what it would have done with.
The lone-surrogate-pair proposal was a totally different proposal than the U+0000 one.
I wasn't referring to the lone-surrogate-pair encoding trick, I was referring to the fact that some people are going to want to treat surrogate pairs as encoding errors (i.e. include the NULL byte) and some will want to treat them as valid. If you want them to be valid you have to normalize away the surrogates in order to talk to other software, but you can't do that because then you'll get different bytes when you re- encode them. There's probably a way around that but it would be subtle and controversial no matter how you did it.
glyph@divmod.com wrote:
The reasoning is that a lot of software doesn't care if it's wrong for edge cases, it's really hard to come up with something that's correct with respect to all of those edge cases (absurdly difficult, if you need to stay in the straightjacket of string / bytes types, as well as provide a useful library interface - which is why we're having this discussion). But, it should be _possible_ to write software that's correct in the face of those edge cases.
I just wanted to highlight this as something to keep in mind during this discussion: we want to keep the easy things easy and make the difficult things possible. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org
Le Wednesday 01 October 2008 04:06:25 glyph@divmod.com, vous avez écrit :
b = gtk.Button(u"\u0000/hello/world")
which emits this message: TypeError: OGtkButton.__init__() argument 1 must be string without null bytes or None, not unicode
SQLite has a similar problem with NULLs, and I'm definitely sticking paths in there, too.
I think that you can say "all C libraries". Would it possible to convert the encoded string to bytes just before call Gtk? (job done by some Python internals, not as an explicit conversion) I don't know if it would help the discussion, but Java uses its own modified UTF-8 encoding: * NUL byte is encoded as 0xc0 0x80 instead of 0x00 * Java doesn't support unicode > 0xFFFF (bouuuuh!) http://java.sun.com/javase/6/docs/api/java/io/DataInput.html#modified-utf-8 -- Victor Stinner aka haypo http://www.haypocalc.com/blog/
SQLite has a similar problem with NULLs, and I'm definitely sticking paths in there, too.
I think that you can say "all C libraries".
Just for the sake of nit-picking: the socket library, and the regular POSIX stream IO library (as well as C standard "unformatted" IO) deal just fine with embedded NULL characters.
* Java doesn't support unicode > 0xFFFF (bouuuuh!)
I don't think that is true anymore. Regards, Martin
participants (11)
-
"Martin v. Löwis"
-
Adam Olsen
-
Bill Janssen
-
glyph@divmod.com
-
Greg Ewing
-
Guido van Rossum
-
James Y Knight
-
M.-A. Lemburg
-
Nick Coghlan
-
Ulrich Eckhardt
-
Victor Stinner