
I don't know an awful lot about unicode, so rather than clog up the already lengthy threads on the 3k list, I figured I'd just toss this idea out over here. As I understand it, there is a fairly major issue regarding malformed unicode data being passed to python, particularly on startup and for filenames. This has lead to much discussion and ultimately the decision (?) to mirror a variety of OS functions to work with both bytes and unicode. Obviously this puts us on a slope of questionable friction to reverting back to 2.x where unicode wasn't "core". My thought is this: When passed invalid unicode, keep it invalid. This is largely similar to the UTF-8b ideas that were being tossed around, but a tad different. The idea would be to maintain invalid byte sequences by use of the private use area in the unicode spec, but be explicit about this conversion to the program. In particular, I'm suggesting the addition of the following (I'll use "surrogate" to refer to the invalid bytes in a unicode string): 1) Encoding 'raw'. Force all bytes to be converted to surrogate values. Decoding to raw converts the bytes back, and gives an error on valid unicode characters(!). This would enable applications to effectively interface with the system using bytes (by setting default encoding or the like), but not require any API changes to actually support the bytes type. 2) Error handler 'force' (or whatever). For decoding, when an invalid byte is encountered, replace with a surrogate. For encoding, write the invalid byte. 2a) Decoding invalid unicode or encoding a string with surrogates raises a UnicodeError (unless handler 'force' is specified or encoding is 'raw'). 3) string method 'valid'. 'valid()' would return False if the string contains at least one surrogate and True otherwise. This would allow programs to check if the string is correct, and handle it not. This would be of particular value when reading boot information like sys.argv as that would use the 'force' error handler in order to prevent boot failure. How the invalid bytes would be stored internally is certainly a matter of hot debate on the 3k list. As I mentioned before, I am not intimately familiar with unicode, so I don't have much to suggest. If I had to implement it myself now, I'd probably use a piece of the private use area as an escape (much like '\\' does). Finally, there seems to be much concern about internal invalid unicode wreaking havoc when tossed to external programs/libraries. I have to say that I don't really see what the problem is, because whenever python writes unicode, oughtn't it be buffered by "encode"? In that case you'd either get an error or would be explicitly allowing invalid strings (via 'raw' or 'force'). And besides, if python has to deal with bad unicode, these libraries should have to too ;). Even more finally, let me apologize in advance if I missed something on another list or this is otherwise too redundant.

Dillon Collins writes:
FWIW this has been suggested several times. There are two problems with it. The first is collisions with other private space users. Unlikely, but it will (eventually) happen. When it does, it will very likely result in data corruption, because those systems will assume that these are valid private sequences, not reencoded pollution. One way to avoid this would be to have a configuration option (runtime) for where to start the private encoding space. It still won't avoid it completely because some applications don't know or care what is valid, and therefore might pass you anything. But mostly it should win because people who are assigning semantics to private space characters will need to know what characters they're using, and the others will rarely be able to detect corruption anyway. The second problem is that internal data will leak to other libraries. There is no reason to suppose that those libraries will see reencoded forms, because the whole point of using the C interface is to work directly on the Python data structures. At that point, you do have corruption, because the original invalid data has been converted to valid Unicode. You write "And besides, if python has to deal with bad unicode, these libraries should have to too ;)." Which is precisely right. The bug in your idea is that they never will! Your proposal robs them of the chance to do it in their own way by buffering it through Python's cleanup process. AFAICS, there are two sane paths. Accept (and document!) that you will pass corruption to other parts of the system, and munge bad octet sequences into some kind of valid Unicode (eg, U+FEFF REPLACEMENT CHARACTER, or a PUA encoding of raw bytes). Second, signal an error on encountering an invalid octet sequence, and leave it up to the user program to handle it.

On Thursday 09 October 2008, Stephen J. Turnbull wrote:
I certainly do agree that assuming PUA codes will never be used is foolish. As I suggested later on, you could use a PUA code as a sort of backslash escape to preserve both the valid PUA code and the invalid data.
Yes and no... While C programs generally work on Python's internal data structures, they shouldn't (and basically don't) do so though direct access to the PyObject struct. Instead, they use the various macros/functions provided. With my proposal, unicode strings would have a valid flag, and one could easily modify PyUnicode_AS_UNICODE to return NULL (and a UnicodeError) if the string is invalid, and make a PyUnicode_AS_RAWUNICODE that wouldn't. Or you could simply document that libraries need to call a PyUnicode_ISVALID to determine whether or not the string contains invalid codes.
What makes this problem nasty all around is that your proposal has the same bug: by not allowing invalid unicode internally, the only way to allow programs to handle the (possible) problem is to always accept bytes, which would put us most of the way back to a 2.x world. At least with my proposal libraries can opt to deal with the bad, albeit slightly sanitized, unicode if it wants to.
Well, the bulk of my proposal was to allow the program to choose which one of those (3!) options they want. I fail to see the benefit of forcing their hands, especially since the API already supports this through the use of both codecs and error handlers. It just seems like a more elegant solution to me.

Dillon Collins wrote:
On Thursday 09 October 2008, Stephen J. Turnbull wrote:
Would it make any sense to have a Filename subclass or a BadFilename subclass or more generally a PUAcode subclass for any unicode generated by the core that uses the PUA? In either of the latter cases, any app using the PUA would/could know not to mix PUAcode instances into their own unicode. And leakage without re-encoding into bytes could be inhibited more easily. tjr

Terry Reedy writes:
Would it make any sense to have a Filename subclass
Sure ... but as glyph has been explaining, that should really be generalized to a representation of filesystem paths, and that is an unsolved problem at the present time.
or a BadFilename subclass or more generally a PUAcode subclass for any unicode generated by the core that uses the PUA?
IMO, this doesn't work, because either they act like strings when you access them naively, and you end up with corrupt Unicode loose in the wider system, or they throw exceptions if they aren't first placated with appropriate rituals -- but those exceptions and rituals are what we wanted to avoid handling in the first place! As I see it, this is not a technical problem! It's a social problem. It's not that we have no good ways to handle Unicode exceptions: we have several. It's not that we have no perfect and universally acceptable way to handle them: as usual, that's way too much to ask. The problem that we face is that there are several good ways to handle the decoding exceptions, and different users/applications will *strongly* prefer different ones. In particular, if we provide one and make it default, almost all programmers will do the easy thing, so that most code will not be prepared for applications that do want a different handler. Code that does expect to receive uncorrupt Unicode will have to do extra checking, etc. I think that the best thing to do would be to improve the exception handling in codecs and library functions like os.listdir() -- IMO the problem that one exception can cost you an entire listing is a bug in os.listdir().

Dillon Collins writes:
It just seems like a more elegant solution to me.
Like most problems rooted in POSIX (more fairly, in implementation dependencies), it's not a problem amenable to elegant solutions. The data is conceptually a human-readable string, and therefore should be representable in Unicode. In practice, it normally is, but there are no guarantees. IMO, in this kind of situation it is best to raise the exception as early as possible to preserve the context in which it occurred. I have no objection to providing a library of handlers to implement the strategies you propose, just to making any of them a Python core default.

Dillon Collins writes:
FWIW this has been suggested several times. There are two problems with it. The first is collisions with other private space users. Unlikely, but it will (eventually) happen. When it does, it will very likely result in data corruption, because those systems will assume that these are valid private sequences, not reencoded pollution. One way to avoid this would be to have a configuration option (runtime) for where to start the private encoding space. It still won't avoid it completely because some applications don't know or care what is valid, and therefore might pass you anything. But mostly it should win because people who are assigning semantics to private space characters will need to know what characters they're using, and the others will rarely be able to detect corruption anyway. The second problem is that internal data will leak to other libraries. There is no reason to suppose that those libraries will see reencoded forms, because the whole point of using the C interface is to work directly on the Python data structures. At that point, you do have corruption, because the original invalid data has been converted to valid Unicode. You write "And besides, if python has to deal with bad unicode, these libraries should have to too ;)." Which is precisely right. The bug in your idea is that they never will! Your proposal robs them of the chance to do it in their own way by buffering it through Python's cleanup process. AFAICS, there are two sane paths. Accept (and document!) that you will pass corruption to other parts of the system, and munge bad octet sequences into some kind of valid Unicode (eg, U+FEFF REPLACEMENT CHARACTER, or a PUA encoding of raw bytes). Second, signal an error on encountering an invalid octet sequence, and leave it up to the user program to handle it.

On Thursday 09 October 2008, Stephen J. Turnbull wrote:
I certainly do agree that assuming PUA codes will never be used is foolish. As I suggested later on, you could use a PUA code as a sort of backslash escape to preserve both the valid PUA code and the invalid data.
Yes and no... While C programs generally work on Python's internal data structures, they shouldn't (and basically don't) do so though direct access to the PyObject struct. Instead, they use the various macros/functions provided. With my proposal, unicode strings would have a valid flag, and one could easily modify PyUnicode_AS_UNICODE to return NULL (and a UnicodeError) if the string is invalid, and make a PyUnicode_AS_RAWUNICODE that wouldn't. Or you could simply document that libraries need to call a PyUnicode_ISVALID to determine whether or not the string contains invalid codes.
What makes this problem nasty all around is that your proposal has the same bug: by not allowing invalid unicode internally, the only way to allow programs to handle the (possible) problem is to always accept bytes, which would put us most of the way back to a 2.x world. At least with my proposal libraries can opt to deal with the bad, albeit slightly sanitized, unicode if it wants to.
Well, the bulk of my proposal was to allow the program to choose which one of those (3!) options they want. I fail to see the benefit of forcing their hands, especially since the API already supports this through the use of both codecs and error handlers. It just seems like a more elegant solution to me.

Dillon Collins wrote:
On Thursday 09 October 2008, Stephen J. Turnbull wrote:
Would it make any sense to have a Filename subclass or a BadFilename subclass or more generally a PUAcode subclass for any unicode generated by the core that uses the PUA? In either of the latter cases, any app using the PUA would/could know not to mix PUAcode instances into their own unicode. And leakage without re-encoding into bytes could be inhibited more easily. tjr

Terry Reedy writes:
Would it make any sense to have a Filename subclass
Sure ... but as glyph has been explaining, that should really be generalized to a representation of filesystem paths, and that is an unsolved problem at the present time.
or a BadFilename subclass or more generally a PUAcode subclass for any unicode generated by the core that uses the PUA?
IMO, this doesn't work, because either they act like strings when you access them naively, and you end up with corrupt Unicode loose in the wider system, or they throw exceptions if they aren't first placated with appropriate rituals -- but those exceptions and rituals are what we wanted to avoid handling in the first place! As I see it, this is not a technical problem! It's a social problem. It's not that we have no good ways to handle Unicode exceptions: we have several. It's not that we have no perfect and universally acceptable way to handle them: as usual, that's way too much to ask. The problem that we face is that there are several good ways to handle the decoding exceptions, and different users/applications will *strongly* prefer different ones. In particular, if we provide one and make it default, almost all programmers will do the easy thing, so that most code will not be prepared for applications that do want a different handler. Code that does expect to receive uncorrupt Unicode will have to do extra checking, etc. I think that the best thing to do would be to improve the exception handling in codecs and library functions like os.listdir() -- IMO the problem that one exception can cost you an entire listing is a bug in os.listdir().

Dillon Collins writes:
It just seems like a more elegant solution to me.
Like most problems rooted in POSIX (more fairly, in implementation dependencies), it's not a problem amenable to elegant solutions. The data is conceptually a human-readable string, and therefore should be representable in Unicode. In practice, it normally is, but there are no guarantees. IMO, in this kind of situation it is best to raise the exception as early as possible to preserve the context in which it occurred. I have no objection to providing a library of handlers to implement the strategies you propose, just to making any of them a Python core default.
participants (3)
-
Dillon Collins
-
Stephen J. Turnbull
-
Terry Reedy