With issue 3672 resolved, it is now unnecessary to introduce an utf-8b codec, since the utf-8 codec will properly report errors for all byte sequences invalid in UTF-8, including lone surrogates. Therefore, utf-8b can be implemented solely through the error handler. Glenn Linderman suggested that the name "python-escape" is not very descriptive, so I've changed the name to "utf8b". I've updated the PEP accordingly. Regards, Martin
2009/5/3 "Martin v. Löwis"
With issue 3672 resolved, it is now unnecessary to introduce an utf-8b codec, since the utf-8 codec will properly report errors for all byte sequences invalid in UTF-8, including lone surrogates. Therefore, utf-8b can be implemented solely through the error handler.
That's even nicer. One minor detail though, in the sentence: "non-decodable bytes >128 will be represented as lone half surrogate" ">" should be ">=". -- Lino Mastrodomenico
Martin v. Löwis
Glenn Linderman suggested that the name "python-escape" is not very descriptive, so I've changed the name to "utf8b".
If the error handler is supposed to be used for codecs other than utf-8, perhaps it should renamed something more generic, e.g. "surrogate-escape"? Also, if utf8-b is not provided as a codec, will there be an easy way for user code to use the same encoding as the IO layer does? (e.g. os.fsdecode/os.fsencode)?
On Sun, May 3, 2009 at 08:43, Antoine Pitrou
Also, if utf8-b is not provided as a codec, will there be an easy way for user code to use the same encoding as the IO layer does? (e.g. os.fsdecode/os.fsencode)?
I like the idea of fsencode/fsdecode functions, but we need to be careful deciding what they accept and produce on Windows. I'd expect them to be identity functions, but then the difference in platform behavior suggests perhaps they should be in os.path. Unicode to Unicode on Windows would further mean fsencode wouldn't be useful for sending filenames over sockets, and "utf8" will be prone to exceptions on the very names we're trying to support right now. Is there an advantage to not providing the the "utf8b" behavior as a registered codec? -- Michael Urman
If the error handler is supposed to be used for codecs other than utf-8, perhaps it should renamed something more generic, e.g. "surrogate-escape"?
Perhaps. However, utf-8b doesn't really have to do anything with utf-8 - it's an algorithm based on 16-bit or 32-bit code points.
Also, if utf8-b is not provided as a codec, will there be an easy way for user code to use the same encoding as the IO layer does?
s.encode(os.getfilesystemencoding(), "utf8b") will do just that (in fact, that's exactly what the IO layer does). Regards, Martin
On Sun, May 3, 2009 at 10:39 AM, "Martin v. Löwis"
If the error handler is supposed to be used for codecs other than utf-8, perhaps it should renamed something more generic, e.g. "surrogate-escape"?
Perhaps. However, utf-8b doesn't really have to do anything with utf-8 - it's an algorithm based on 16-bit or 32-bit code points.
To me that lack of relationship with utf8 suggests that it should not be called utf8b... But I don't have any good suggestions.
Also, if utf8-b is not provided as a codec, will there be an easy way for user code to use the same encoding as the IO layer does?
s.encode(os.getfilesystemencoding(), "utf8b") will do just that (in fact, that's exactly what the IO layer does).
Regards, Martin _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/greg%40krypto.org
> If the error handler is supposed to be used for codecs other than utf-8, > perhaps it should renamed something more generic, e.g. "surrogate-escape"?
Perhaps. However, utf-8b doesn't really have to do anything with utf-8 - it's an algorithm based on 16-bit or 32-bit code points.
To me that lack of relationship with utf8 suggests that it should not be called utf8b
Perhaps. However, giving it that name was Markus Kuhn's choice - and while it may be confusing, it's (IMO) useful to be consistent with this background. Regards, Martin
On Sun, May 3, 2009 at 1:27 PM, "Martin v. Löwis"
> If the error handler is supposed to be used for codecs other than utf-8, > perhaps it should renamed something more generic, e.g. "surrogate-escape"?
Perhaps. However, utf-8b doesn't really have to do anything with
utf-8 -
it's an algorithm based on 16-bit or 32-bit code points.
To me that lack of relationship with utf8 suggests that it should not be called utf8b
Perhaps. However, giving it that name was Markus Kuhn's choice - and while it may be confusing, it's (IMO) useful to be consistent with this background.
Regards, Martin
Ah, right. My original searches for utf8b didn't turn up much but searching on his name turns some up. Good choice of name then. http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html http://bsittler.livejournal.com/10381.html http://hyperreal.org/~est/utf-8b/ -gps
On 2009-05-03 19:39, Martin v. Löwis wrote:
If the error handler is supposed to be used for codecs other than utf-8, perhaps it should renamed something more generic, e.g. "surrogate-escape"?
Perhaps. However, utf-8b doesn't really have to do anything with utf-8 - it's an algorithm based on 16-bit or 32-bit code points.
If the error handler doesn't have anything to do with UTF-8, then why do you use "utf8" in the name. Please use a more descriptive name for the handler which does not cause confusion with a existing codec. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 05 2009)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2009-06-29: EuroPython 2009, Birmingham, UK 54 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
M.-A. Lemburg wrote:
On 2009-05-03 19:39, Martin v. Löwis wrote:
If the error handler is supposed to be used for codecs other than utf-8, perhaps it should renamed something more generic, e.g. "surrogate-escape"? Perhaps. However, utf-8b doesn't really have to do anything with utf-8 - it's an algorithm based on 16-bit or 32-bit code points.
If the error handler doesn't have anything to do with UTF-8, then why do you use "utf8" in the name.
Please use a more descriptive name for the handler which does not cause confusion with a existing codec.
Having already been confused, I agree.
M.-A. Lemburg writes:
On 2009-05-03 19:39, Martin v. Löwis wrote:
If the error handler is supposed to be used for codecs other than utf-8, perhaps it should renamed something more generic, e.g. "surrogate-escape"?
Perhaps. However, utf-8b doesn't really have to do anything with utf-8 - it's an algorithm based on 16-bit or 32-bit code points.
I don't understand this phrasing. The algorithm is only applicable to ASCII-compatible octet streams. It results in code points by a simple displacement of octet -> octet + 0xDC00. It cannot be used on (say) UTF-32 to deal with embedded surrogates. Certainly, the computation requires (at least) 16 bit numbers, but the input must be restricted to a stream of 8-bit code points, while the output is 16- or 32-bit code points.
Please use a more descriptive name [than "utf-8b"] for the handler which does not cause confusion with a existing codec.
But please don't use "surrogate-escape" or (as in the current PEP) "python-escape"; it's not an escaping (quotation) mechanism. "surrogate-replace", "surrogate-substitute", or "surrogate-translate" would be better names.
"Martin v. Löwis" writes:
I've updated the PEP accordingly.
I have three substantive comments. First, although consequences for Python 3 byte interfaces (ie, "none") are explicitly stated, as far as I can see this PEP could apply to Python 2 as well. I don't think it's intended that way. Either way, I think you should clarify that point. Second, I suggest "surrogate-replace" as the name of the error handler rather than "utf8b". (Elsewhere I've suggested others, but I think this is the best of the bunch.) Third, it is not clear to me why non-decodable ASCII should be an error. There are plenty of low surrogates for the purpose. Is there another technical reason? Stupid or not, Shift-JIS- and Big5-encoded file systems are quite common in Asia still (including non-rewritable media). I think surrogate-replacement of ASCII should at least be an option. I don't think "people shouldn't be using non-ASCII-compatible encodings for locale encodings" is a sufficient rationale for a hard error here. I mean, of course they *should* be using UTF-8. Maybe Python 3.1 should just go ahead and error on any other encoding on POSIX platforms? <wink> I have a number of nitpicking comments and technical clarifications on the PEP. Rationale is in footnotes. There were also a few typos I noticed. 1. There is no such thing as a "half-surrogate" in Unicode. "Lone surrogate" is clear enough. Or for somewhat fancier English, "isolated surrogate" or "non-syntactic surrogate". To emphasize that Python codecs will only produce them in contexts where a Unicode character or high surrogate (for UTF-16 Python) is syntactically required, "isolated low surrogate" or "isolated trailing surrogate" might be good.[1] 2. The specification should state, and the discussion emphasize, that strings which were produced by surrogate replacement *must not* be used in data interchange with systems that do not specifically accept such strings, and that this is the responsibility of the application.[2] Rather than saying that "dealing with such conflicts is out of scope of this PEP", I would say """Dealing with such conflicts is the responsibility of the application. Since this PEP's mechanism produces valid Unicode where possible, and produces *invalid* code points only via the error handler, one strategy is for the application to validate all other sources of strings as Unicode conforming. There may be other useful application-specific strategies, as well.""" 3. In the discussion, the transition from the example of alternative use of 'python-escape' to discussion of the error handler interface extension is a bit abrupt. I suggest rewriting as: """The extension to the encode error handler interface proposed by this PEP is necessary to implement the 'utf8b' error handler, because there are required byte sequences which cannot be generated from replacement Unicode. However, the encode error handler interface presently requires replacement Unicode to be provided in lieu of the non-encodable Unicode from the source string. Then it promptly encodes that replacement Unicode. In some error handlers, such as the 'utf8b' proposed here, it is also simpler and more efficient for the error handler to provide a pre-encoded replacement byte string, rather than forcing it to calculating Unicode from which the encoder would create the desired bytes.""" Typos (line references are to pep-0383.txt svn r72332): l. 86: "Byte-orientied" -> "Byte-oriented" l. 98, 118, 124, 127, 132, 136: "python-escape" -> "utf8b" l. 130: "provide" -> "provided" l. 134: "calculating" -> "calculate" Footnotes: [1] Unicode 5.0 uses the terms "high-half" and "low-half" at least once, in section 16.6, but the context is such that I take it to refer to "half of the surrogate area". Section 3.8 doesn't use these, instead noting that "leading" and "trailing" are sometimes used instead of "high" and "low". Better to avoid the word "half" in PEP 383, I think. [2] Since this error handler is going to be the default for POSIX I/O, of course people are going to mostly ignore that restriction. The point is, passing such strings to systems that don't expect them is a bug, and the PEP should make it clear that it's the app's bug, not the other system's. On the other hand, using those strings in a context of consenting adults (and I do mean double-opt-in here) is perfectly acceptable. I'm specifically thinking of use in the Tahoe protocol discussed by Zooko O'Whielacronx; it may not be usable there for backward compatibility reasons, but "Unicode conformance" is not an issue in principle. This does imply that programs that take advantage of the error handler specified in this PEP are on their own if they accept data from any sources that are not known to be Unicode-conforming. OTOH, as far as I can see if other sources are known to be Unicode conformant, it's reasonably (but not perfectly) safe to combine them with strings from this PEP (and of course use either 'utf8b' or 'strict', as appropriate, when passing data out of Python).
On Tue, May 5, 2009 at 8:57 AM, Stephen J. Turnbull
2. The specification should state, and the discussion emphasize, that strings which were produced by surrogate replacement *must not* be used in data interchange with systems that do not specifically accept such strings, and that this is the responsibility of the application.[2]
That sounds like a useful statement to make. How would an application make sure that they were producing only valid unicode? How about add an option to os.listdir() named "errors" with default value 'utf8b' (or 'surrogate-replace', or whatever the name is)? Then applications which need to produce only valid unicode strings could pass errors=strict, errors=ignore, or errors=replace? (If anyone really wants behavior like Python 3.0 then we could perhaps also add a new one just for os.listdir() named errors=skipfilename.) My most recent plan for Tahoe, as of the letter that I sent last night, is to emulate the behavior of Nautilus and GNU ls by using the 'replace' error handler and (emulating Nautilus) to append " (invalid encoding)" to the end of the string. (screenshot: http://zooko.com/Nautilus_vs_invalid_encoding.png ) So if I could ask os.listdir to return filenames with U+FFFD in place of undecodable characters, then I could subsequently do something like: for f in os.listdir(d, errors='replace'): if u"\ufffd" in f: f += " (invalid encoding)" (On top of that I would have to check for collisions, but that's out of scope.) Regards, Zooko
Stephen J. Turnbull wrote:
"Martin v. Löwis" writes:
I've updated the PEP accordingly.
I have three substantive comments. First, although consequences for Python 3 byte interfaces (ie, "none") are explicitly stated, as far as I can see this PEP could apply to Python 2 as well. I don't think it's intended that way. Either way, I think you should clarify that point.
Second, I suggest "surrogate-replace" as the name of the error handler rather than "utf8b". (Elsewhere I've suggested others, but I think this is the best of the bunch.)
+1
Third, it is not clear to me why non-decodable ASCII should be an error. There are plenty of low surrogates for the purpose. Is there another technical reason? Stupid or not, Shift-JIS- and Big5-encoded file systems are quite common in Asia still (including non-rewritable media). I think surrogate-replacement of ASCII should at least be an option.
I don't think "people shouldn't be using non-ASCII-compatible encodings for locale encodings" is a sufficient rationale for a hard error here. I mean, of course they *should* be using UTF-8. Maybe Python 3.1 should just go ahead and error on any other encoding on POSIX platforms? <wink>
I don't see why the error handler couldn't in principle be used with encodings other than UTF-8, although in that case all of the low surrogates should be open to use.
I have a number of nitpicking comments and technical clarifications on the PEP. Rationale is in footnotes. There were also a few typos I noticed.
1. There is no such thing as a "half-surrogate" in Unicode. "Lone surrogate" is clear enough. Or for somewhat fancier English, "isolated surrogate" or "non-syntactic surrogate". To emphasize that Python codecs will only produce them in contexts where a Unicode character or high surrogate (for UTF-16 Python) is syntactically required, "isolated low surrogate" or "isolated trailing surrogate" might be good.[1]
2. The specification should state, and the discussion emphasize, that strings which were produced by surrogate replacement *must not* be used in data interchange with systems that do not specifically accept such strings, and that this is the responsibility of the application.[2]
Rather than saying that "dealing with such conflicts is out of scope of this PEP", I would say
"""Dealing with such conflicts is the responsibility of the application. Since this PEP's mechanism produces valid Unicode where possible, and produces *invalid* code points only via the error handler, one strategy is for the application to validate all other sources of strings as Unicode conforming. There may be other useful application-specific strategies, as well."""
3. In the discussion, the transition from the example of alternative use of 'python-escape' to discussion of the error handler interface extension is a bit abrupt. I suggest rewriting as:
"""The extension to the encode error handler interface proposed by this PEP is necessary to implement the 'utf8b' error handler, because there are required byte sequences which cannot be generated from replacement Unicode. However, the encode error handler interface presently requires replacement Unicode to be provided in lieu of the non-encodable Unicode from the source string. Then it promptly encodes that replacement Unicode. In some error handlers, such as the 'utf8b' proposed here, it is also simpler and more efficient for the error handler to provide a pre-encoded replacement byte string, rather than forcing it to calculating Unicode from which the encoder would create the desired bytes."""
Typos (line references are to pep-0383.txt svn r72332):
l. 86: "Byte-orientied" -> "Byte-oriented" l. 98, 118, 124, 127, 132, 136: "python-escape" -> "utf8b" l. 130: "provide" -> "provided" l. 134: "calculating" -> "calculate"
Footnotes: [1] Unicode 5.0 uses the terms "high-half" and "low-half" at least once, in section 16.6, but the context is such that I take it to refer to "half of the surrogate area". Section 3.8 doesn't use these, instead noting that "leading" and "trailing" are sometimes used instead of "high" and "low". Better to avoid the word "half" in PEP 383, I think.
"Leading" and "trailing" simply state the order, not the set ("high" or "low"), so are not good terms to use.
[2] Since this error handler is going to be the default for POSIX I/O, of course people are going to mostly ignore that restriction. The point is, passing such strings to systems that don't expect them is a bug, and the PEP should make it clear that it's the app's bug, not the other system's. On the other hand, using those strings in a context of consenting adults (and I do mean double-opt-in here) is perfectly acceptable. I'm specifically thinking of use in the Tahoe protocol discussed by Zooko O'Whielacronx; it may not be usable there for backward compatibility reasons, but "Unicode conformance" is not an issue in principle.
This does imply that programs that take advantage of the error handler specified in this PEP are on their own if they accept data from any sources that are not known to be Unicode-conforming. OTOH, as far as I can see if other sources are known to be Unicode conformant, it's reasonably (but not perfectly) safe to combine them with strings from this PEP (and of course use either 'utf8b' or 'strict', as appropriate, when passing data out of Python).
Should there be a function or method to check for conformance and lone surrogates?
Zooko O'Whielacronx writes:
How would an application make sure that they were producing only valid unicode?
That's very difficult. There are a couple of sources that I can think of, in Python: C modules, chr(), \u literals, and now codecs with the 'utf8b'. There may be others. You'd need to review your own code for all of them very carefully, and you'd have to validate all strings returned by non-validating APIs (which is all of them in Python now, although many of them can probably be trusted, such as codecs not using the 'utf8b' error handler).
How about add an option to os.listdir() named "errors" with default value 'utf8b'
Seems reasonable to me, but Martin's probably thought more carefully about it. I don't think its applicable to your use case, though, because you want to be able to *access* those files as well as display the names to the users, right? You won't be able to access those files if you receive the names already munged by the error handler.
MRAB writes:
I don't think "people shouldn't be using non-ASCII-compatible encodings for locale encodings" is a sufficient rationale for a hard error here. I mean, of course they *should* be using UTF-8. Maybe Python 3.1 should just go ahead and error on any other encoding on POSIX platforms? <wink>
I don't see why the error handler couldn't in principle be used with encodings other than UTF-8, although in that case all of the low surrogates should be open to use.
I should have been more clear here, I guess. The error handler *can*, and in the PEP *will be* by default, used with all "sane" locale encodings on POSIX. It occurs to me that the PEP maybe should say that it is an error to have your POSIX locale set to UTF-16 or something like that. What "sane" means in this context is 1. ASCII NUL is the bytearray terminator, and can't be used as a byte in a file name. This rules out UTF-16, UTF-32, and widechar EUC encodings, as well as some very rare ones. 2. An ASCII character always translates to the Unicode character with the same code (ie, "to itself"). It is not a part of other sequences (control sequences, or a trailing byte). This rules out EBCDIC, ISO-2022-*, Shift JIS, and Big5, among the encodings I'm familiar with. EBCDIC because only by accident will an EBCDIC character map to the same ASCII character with the same code. The ISO-2022-* encodings are out because ASCII characters are used in escape sequences. Shift JIS and Big5 because in those encodings, a high-bit-set octet signals the start of a multibyte sequence, and some of the trailing bytes may be in the ASCII range. What's left? Well, UTF-8, all of the ISO-8859 sets, several national standards (such as the KOI8 family for Cyrillic), IBM and Microsoft "code pages", and the "packed" EUC encodings used for Japanese, Chinese, and Korean. These all have the character that ASCII is ASCII, and all non-ASCII characters are encoded using only high-bit-set octets. In fact, in practice, on Unix these are invariably what you encounter. So what's the problem? Backward compatibility for Microsoft OSes, which not only used to use MBCS national character sets, but "cleverly" packed more characters into the encoding by using ASCII as trailing bytes. Ie, the aforementioned "insane" Shift JIS (which is mandated by the leading Japanese cellphone service provider even today) and Big5 (the leading encoding for Chinese until very recently). These are very commonly found on archival media, and even on USB keys and so on which tend to be FAT-formatted. This doesn't prevent usage of the Unicode APIs, but up to Windows 2000 most Japanese vendors' OEM version of Windows used FAT format and Shift JIS as the file system encoding, and I know of Japanese offices where Windows 98 systems were in use as recently as early 2007. It's the removable media which are the problem, because on Windows you just use the Unicode APIs. But they're not available on Unix, so you need the byte-oriented APIs. Is this a real problem? I don't know, I don't do Windows, I don't do computing with my cellphone, and I don't need to get Japanese (that might be mixed with Russian ones!!) filenames off of ancient media or CIFS fileshares using Shift JIS. I guess it's possible that cellphones do everything *except* add filenames to directories in Shift JIS, but the filenames are in UTF-16. OTOH, it seems to me that an *optional* extension to handling error on ASCII is technically feasible and would be nearly trivial to add to the PEP. The biggest cost would be adding the error argument to various functions (as Zooko requested) so that surrogate-replace-extended could be specified if needed.
Footnotes: [1] Unicode 5.0 uses the terms "high-half" and "low-half" at least once, in section 16.6, but the context is such that I take it to refer to "half of the surrogate area". Section 3.8 doesn't use these, instead noting that "leading" and "trailing" are sometimes used instead of "high" and "low". Better to avoid the word "half" in PEP 383, I think.
"Leading" and "trailing" simply state the order, not the set ("high" or "low"), so are not good terms to use.
But it's the order that's important. If you've just finished reading a character, and encounter a trailing surrogate, then it was produced by the 'utf8b' error handler; nothing else in a Python codec can do that. If you've just finished reading a character, are in a UTF-16 Python, and encounter a leading surrogate, then you immediately gobble the following code, which must be a trailing surrogate, and combine them to produce a character. The remaining case is that you encounter a valid character. Anything else is an error, and (assuming no bugs), no Python codec will produce anything else.
This does imply that programs that take advantage of the error handler specified in this PEP are on their own if they accept data from any sources that are not known to be Unicode-conforming. OTOH, as far as I can see if other sources are known to be Unicode conformant, it's reasonably (but not perfectly) safe to combine them with strings from this PEP (and of course use either 'utf8b' or 'strict', as appropriate, when passing data out of Python).
Should there be a function or method to check for conformance and lone surrogates?
string.encode('utf-8',errors=strict) will do for now.
Stephen J. Turnbull wrote:
MRAB writes:
I don't think "people shouldn't be using non-ASCII-compatible encodings for locale encodings" is a sufficient rationale for a hard error here. I mean, of course they *should* be using UTF-8. Maybe Python 3.1 should just go ahead and error on any other encoding on POSIX platforms? <wink>
I don't see why the error handler couldn't in principle be used with encodings other than UTF-8, although in that case all of the low surrogates should be open to use.
I should have been more clear here, I guess. The error handler *can*, and in the PEP *will be* by default, used with all "sane" locale encodings on POSIX.
It occurs to me that the PEP maybe should say that it is an error to have your POSIX locale set to UTF-16 or something like that.
What "sane" means in this context is
1. ASCII NUL is the bytearray terminator, and can't be used as a byte in a file name. This rules out UTF-16, UTF-32, and widechar EUC encodings, as well as some very rare ones.
[snip] It might be slightly OT, but sometimes strict UTF-8 encoding is violated by encoding U+0000 using 2 bytes (0xC0 0x80) so that 0x00 can be used as a terminator. I think I read that Microsoft sometimes does this.
MRAB writes:
[snip] It might be slightly OT, but sometimes strict UTF-8 encoding is violated by encoding U+0000 using 2 bytes (0xC0 0x80) so that 0x00 can be used as a terminator. I think I read that Microsoft sometimes does this.
Nice hack! as long as you don't let it escape. But if 'strict' errors on this, then PEP 383 'utf8b' will do the right thing, I think.
2009/5/5 Stephen J. Turnbull
Third, it is not clear to me why non-decodable ASCII should be an error.
The PEP originally allowed the conversion to U+DCxx of bytes below 128 that cannot be decoded by the encoding used, but this creates potential security problems. See: http://mail.python.org/pipermail/python-dev/2009-April/089102.html -- Lino Mastrodomenico
Perhaps. However, utf-8b doesn't really have to do anything with utf-8 - it's an algorithm based on 16-bit or 32-bit code points.
I don't understand this phrasing. The algorithm is only applicable to ASCII-compatible octet streams. It results in code points by a simple displacement of octet -> octet + 0xDC00. It cannot be used on (say) UTF-32 to deal with embedded surrogates.
Certainly, the computation requires (at least) 16 bit numbers, but the input must be restricted to a stream of 8-bit code points, while the output is 16- or 32-bit code points.
Right - the algorithm maps between bytes and 16/32-bit code units. It works, in particular, for UTF-8, and was originally proposed to apply to UTF-8 - but it can work in any other place that converts bytes to 16/32-bit code units as well. Regards, Martin
I have three substantive comments. First, although consequences for Python 3 byte interfaces (ie, "none") are explicitly stated, as far as I can see this PEP could apply to Python 2 as well. I don't think it's intended that way. Either way, I think you should clarify that point.
Done: the Python-Version header already clarifies that point.
Second, I suggest "surrogate-replace" as the name of the error handler rather than "utf8b".
I think this is bike-shedding.
Third, it is not clear to me why non-decodable ASCII should be an error. There are plenty of low surrogates for the purpose. Is there another technical reason? Stupid or not, Shift-JIS- and Big5-encoded file systems are quite common in Asia still (including non-rewritable media). I think surrogate-replacement of ASCII should at least be an option.
It's a security risk. If U+DCXX would map to \xXX, then somebody could embed U+DC2E U+DC2E U+DC2F into a character string; even if this gets sanitized, nobody would expect that this will actually access ../
1. There is no such thing as a "half-surrogate" in Unicode. "Lone surrogate" is clear enough. Or for somewhat fancier English, "isolated surrogate" or "non-syntactic surrogate". To emphasize that Python codecs will only produce them in contexts where a Unicode character or high surrogate (for UTF-16 Python) is syntactically required, "isolated low surrogate" or "isolated trailing surrogate" might be good.[1]
Fixed. I removed the world "half" everywhere. It really doesn't mean anything to me (it could have been called sunnygate instead, making no difference). I tried to understand "surrogate", and it was explained to me that "surrogate" is something that stands for something - but then I would argue that the two subsequence codes form a surrogate - they stand for something else. The individual surrogate code (in Unicode terminology) doesn't stand for anything. So don't you agree that it is the Unicode terminology that is in error, not the PEP?
2. The specification should state, and the discussion emphasize, that strings which were produced by surrogate replacement *must not* be used in data interchange with systems that do not specifically accept such strings, and that this is the responsibility of the application.[2]
No. The specification puts no requirements on applications whatsoever. So if you propose to use MUST NOT in the RFC 2119 sense, I strongly disagree. Applications that desire mojibake are free to produce it; we are consenting adults; and all that.
3. In the discussion, the transition from the example of alternative use of 'python-escape' to discussion of the error handler interface extension is a bit abrupt. I suggest rewriting as:
"""The extension to the encode error handler interface proposed by this PEP is necessary to implement the 'utf8b' error handler, because there are required byte sequences which cannot be generated from replacement Unicode. However, the encode error handler interface presently requires replacement Unicode to be provided in lieu of the non-encodable Unicode from the source string. Then it promptly encodes that replacement Unicode. In some error handlers, such as the 'utf8b' proposed here, it is also simpler and more efficient for the error handler to provide a pre-encoded replacement byte string, rather than forcing it to calculating Unicode from which the encoder would create the desired bytes."""
Unfortunately, I failed to understand where you want this text to go. What paragraphs should I remove, or (if none), after which paragraph should I insert this text? Regards, Martin
It occurs to me that the PEP maybe should say that it is an error to have your POSIX locale set to UTF-16 or something like that.
No. It is *impossible* to have UTF-16 as the locale character set, not an error. Your statement is like saying "it is an error to breathe in the vacuum". In any case, the discussion says # Encodings that are not compatible with ASCII are not supported by # this specification; bytes in the ASCII range that fail to decode # will cause an exception. It is widely agreed that such encodings # should not be used as locale charsets. Regards, Martin
Martin v. Löwis wrote:
I have three substantive comments. First, although consequences for Python 3 byte interfaces (ie, "none") are explicitly stated, as far as I can see this PEP could apply to Python 2 as well. I don't think it's intended that way. Either way, I think you should clarify that point.
Done: the Python-Version header already clarifies that point.
Second, I suggest "surrogate-replace" as the name of the error handler rather than "utf8b".
I think this is bike-shedding.
The name "utf8b" suggested in the PEP is not in line with the codec design and causes confusion with an existing codec of a similar name. Error handlers and codecs are two different things, so the namespaces need to be clearly separate. Please change the name of the error handler to a different name that does not resemble or cause confusion with a codec name and fits the scheme of error handler names we already have in place in Python for replacing error handlers, i.e. "XYZreplace". Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 06 2009)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2009-06-29: EuroPython 2009, Birmingham, UK 53 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
"Martin v. Löwis" writes:
It occurs to me that the PEP maybe should say that it is an error to have your POSIX locale set to UTF-16 or something like that.
No. It is *impossible* to have UTF-16 as the locale character set, not an error. Your statement is like saying "it is an error to breathe in the vacuum".
I realize this is not useful, so maybe you don't need to mention it. However, it certainly is possible to set LANG with an absurd, or merely dangerous, encoding.
In any case, the discussion says
# Encodings that are not compatible with ASCII are not supported by # this specification; bytes in the ASCII range that fail to decode # will cause an exception. It is widely agreed that such encodings # should not be used as locale charsets.
Which is your excuse for not supporting Shift JIS fully. It doesn't stop people from setting LC_ALL=ja_JP.shift_jis, or using Shift JIS as the default encoding for certain media.
Lino Mastrodomenico writes:
2009/5/5 Stephen J. Turnbull
: Third, it is not clear to me why non-decodable ASCII should be an error.
The PEP originally allowed the conversion to U+DCxx of bytes below 128 that cannot be decoded by the encoding used, but this creates potential security problems.
See: http://mail.python.org/pipermail/python-dev/2009-April/089102.html
Yeah, yeah, this is the same old same old from PEP 3131. Anything that handles the various attacks based on ASCII-alike characters should at least rule out invalid Unicode, too! And where is this U+DC2F supposed to be coming from, anyway? The user's *local* environment or the user's *local* filesystem! Codecs not using 'utf8b' can't produce it, so the only other cases are chr() and \u literals in the *local* process, or an already broken module in your code. I really can't imagine that any sane programmer these days would be using 'utf8b' on bytes received from the Internet! Of course I can't prove that there's no vector for an exploit here (in fact, I'm sure there is one with sufficiently careless handling of input), but I think "consenting adults" covers the Shift JIS use case. Make it an option, but it should be explicitly part of the PEP.
"Martin v. Löwis" writes:
Done: the Python-Version header already clarifies that point.
Ah, OK. I wish my day job required reading more PEPs so I'd be more familiar with these formalities. :-)
Second, I suggest "surrogate-replace" as the name of the error handler rather than "utf8b".
I think this is bike-shedding.
I don't personally care (I already was aware of UTF-8B), but there are plenty of others who do. I think that's a good name to make Marc-Andre and Terry happier. You have to fix the existing uses of the obsolete "python-escape", anyway.
It's a security risk. If U+DCXX would map to \xXX, then somebody could embed U+DC2E U+DC2E U+DC2F into a character string; even if this gets sanitized, nobody would expect that this will actually access ../
The odds that anybody will actually take notice of U+002E U+002E U+002F in a string are sufficiently small that any number of exploits have already been based on it. I agree that there is some additional risk from this if people make the check for "../" before they prepend "\ucd2e\udc2e\udc2f", but I think that risk is very small compared to the pain of having a error handler whose raison d'etre is to not raise exceptions go ahead and raise them anyway. See also my reply to Lino Mastrodomenico. Again, an option is good enough for my purposes as long as interfaces for os.listdir() and the like support setting the error handler (cf. Zooko's proposal), but I think the option should be available.
I tried to understand "surrogate", and it was explained to me that "surrogate" is something that stands for something - but then I would argue that the two subsequence codes form a surrogate - they stand for something else. The individual surrogate code (in Unicode terminology) doesn't stand for anything. So don't you agree that it is the Unicode terminology that is in error, not the PEP?
Plausibly so. Keep making comments like that and nobody will ever let you off the hook for being a non-native speaker! However, "surrogate" in English is typically used in situation that are too complex to be covered by simply "substitution." I've always read "surrogate" as "alternative form of encoding", and "surrogate code point" as "code point in that alternative form of encoding". Where it's an alternative to code-point-is-scalar-value. I think probably the authors of the terminology just made the best of a bad situation, I can't think of a better single word for this.
No. The specification puts no requirements on applications whatsoever. So if you propose to use MUST NOT in the RFC 2119 sense, I strongly disagree.
I do propose that. But you're writing the PEP, so this battle will have to be deferred. Eventually Python will have to take a stand on Unicode conformance, but it's not urgent yet.
3. In the discussion, the transition from the example of alternative use of 'python-escape' to discussion of the error handler interface extension is a bit abrupt. I suggest rewriting as:
"""The extension to the encode error handler interface proposed by this PEP is necessary to implement the 'utf8b' error handler, because there are required byte sequences which cannot be generated from replacement Unicode. However, the encode error handler interface presently requires replacement Unicode to be provided in lieu of the non-encodable Unicode from the source string. Then it promptly encodes that replacement Unicode. In some error handlers, such as the 'utf8b' proposed here, it is also simpler and more efficient for the error handler to provide a pre-encoded replacement byte string, rather than forcing it to calculating Unicode from which the encoder would create the desired bytes."""
Unfortunately, I failed to understand where you want this text to go. What paragraphs should I remove, or (if none), after which paragraph should I insert this text?
Sorry! I suggest substituting the paragraph above for the paragraph which begins "The encode error handler interface presentlyrequires..." at line 129. I think I forgot to do this before: "I hereby dedicate all text I suggest for inclusion in the PEP to the public domain."
The name "utf8b" suggested in the PEP is not in line with the codec design
Where is that design documented, and how exactly violates the name the design (chapter and verse, please).
Error handlers and codecs are two different things, so the namespaces need to be clearly separate.
They *are* separate naemspaces; that's guaranteed by the implementation. Regards, Martin
Stephen J. Turnbull wrote:
"Martin v. Löwis" writes:
It occurs to me that the PEP maybe should say that it is an error to have your POSIX locale set to UTF-16 or something like that.
No. It is *impossible* to have UTF-16 as the locale character set, not an error. Your statement is like saying "it is an error to breathe in the vacuum".
I realize this is not useful, so maybe you don't need to mention it. However, it certainly is possible to set LANG with an absurd, or merely dangerous, encoding.
How so? The C library will filter it out.
In any case, the discussion says
# Encodings that are not compatible with ASCII are not supported by # this specification; bytes in the ASCII range that fail to decode # will cause an exception. It is widely agreed that such encodings # should not be used as locale charsets.
Which is your excuse for not supporting Shift JIS fully. It doesn't stop people from setting LC_ALL=ja_JP.shift_jis,
Well, it *does* stop them from doing so if their systems don't support the locale setting. In any case, if they do this, PEP 383 will not support them.
or using Shift JIS as the default encoding for certain media.
I fail to see how this could ever matter. If, by "media", you mean things like removable disks, and the file name encoding used on them, it's fairly irrelevant for the PEP, since Python won't start using Shift JIS as its file system encoding just because that's the encoding used on the disk. Regards, Martin
Second, I suggest "surrogate-replace" as the name of the error handler rather than "utf8b".
I think this is bike-shedding.
I don't personally care (I already was aware of UTF-8B), but there are plenty of others who do.
I think it is a fairly bad name, because it is easy to confuse it with the "surrogates" error handler (unless you suggest to rename that also).
You have to fix the existing uses of the obsolete "python-escape", anyway.
Indeed - but only in the PEP. In the implementation, it's already utf8b throughout. Now it is also in the PEP; thanks for pointing that out.
It's a security risk. If U+DCXX would map to \xXX, then somebody could embed U+DC2E U+DC2E U+DC2F into a character string; even if this gets sanitized, nobody would expect that this will actually access ../
The odds that anybody will actually take notice of U+002E U+002E U+002F in a string are sufficiently small that any number of exploits have already been based on it. I agree that there is some additional risk from this if people make the check for "../" before they prepend "\ucd2e\udc2e\udc2f", but I think that risk is very small compared to the pain of having a error handler whose raison d'etre is to not raise exceptions go ahead and raise them anyway.
The problem is that functions like normpath will recognize ../, and that applications rely on them for file name sanitation. If they could be tricked into writing outside of their target folders, this would be a huge security risk. OTOH, I don't care breaking applications on misconfigured systems. People using SJIS as their locale encodings have bigger problems than Python raising exceptions.
See also my reply to Lino Mastrodomenico.
URL?
But you're writing the PEP, so this battle will have to be deferred. Eventually Python will have to take a stand on Unicode conformance, but it's not urgent yet.
I think it's always applications that are conforming or not, rather than libraries. Libraries should allow to write conforming applications. They may refuse to write certain non-conforming applications (although users then replace the library with one that does allow them to do what they want). Libraries can never enforce that applications conform to some standard.
Sorry! I suggest substituting the paragraph above for the paragraph which begins "The encode error handler interface presentlyrequires..." at line 129.
Ah, ok. This was Glen Linderman's text before - now it's yours :-)
I think I forgot to do this before: "I hereby dedicate all text I suggest for inclusion in the PEP to the public domain."
:-) Martin
Yeah, yeah, this is the same old same old from PEP 3131. Anything that handles the various attacks based on ASCII-alike characters should at least rule out invalid Unicode, too!
And where is this U+DC2F supposed to be coming from, anyway? The user's *local* environment or the user's *local* filesystem!
Why is that not a threat? Suppose you have a setuid application, and you pass some string on the command line that decodes to /../. Then the setuid application will be tricked into modifying files it didn't mean to modify. Likewise, it might come from a relational database. Use a relational database that supports unicode code units, or lone surrogates through utf-8, and fill in some bogus data. Then have the Python application (running as root) read it.
Of course I can't prove that there's no vector for an exploit here (in fact, I'm sure there is one with sufficiently careless handling of input), but I think "consenting adults" covers the Shift JIS use case. Make it an option, but it should be explicitly part of the PEP.
Nothing is lost at the moment. If users complain, we can still think of ways to enhance the experience. In any case, Python 3.1b1 may get released today, so it's way too late for new features in the PEP. They can wait for Python 3.2. Regards, Martin
Martin v. Löwis
I don't personally care (I already was aware of UTF-8B), but there are plenty of others who do.
I think it is a fairly bad name, because it is easy to confuse it with the "surrogates" error handler (unless you suggest to rename that also).
I didn't bother to say it at the time, but I think "surrogates" is a pretty bad name. It should be more indicative of what it does, e.g. "surrogates-pass", or "surrogates-accept".
It's a security risk. If U+DCXX would map to \xXX, then somebody could embed U+DC2E U+DC2E U+DC2F into a character string; even if this gets sanitized, nobody would expect that this will actually access ../
Agreed this is an annoying security breach. The whole point of the PEP is that application developers do not have to care about filename encoding issues, which is defeated is they have to check for strange (illegal) combinations of characters. By the way, what are the ASCII characters that are not suppported by Shift-JIS? Not many I suppose? (if I read the Wikipedia entry correctly, it's only the backslash and the tilde). Regards Antoine.
"Martin v. Löwis" writes:
I fail to see how this could ever matter. If, by "media", you mean things like removable disks, and the file name encoding used on them, it's fairly irrelevant for the PEP, since Python won't start using Shift JIS as its file system encoding just because that's the encoding used on the disk.
I'm sorry for the lack of clarity of my posts, but somehow you're completely missing the point. The point is precisely that Python *won't* use Shift JIS as the file system encoding (if it did there would be no problem with reading Shift JIS), but the people who created the media *did*. Now, with Python's file system encoding == UTF-8 or any packed EUC, and more than a handful of Shift JIS or Big5 characters in file names, one is *almost certain* to encounter ASCII as the second byte of a multibyte sequence. PEP 383 can't handle this, but it is sure to be the most common use case for PEP 383 in East Asia.
Martin v. Löwis wrote:
The name "utf8b" suggested in the PEP is not in line with the codec design
Where is that design documented, and how exactly violates the name the design (chapter and verse, please).
Martin, I designed the whole Python codec machinery, so even if this is not explicitly written down somewhere, you can take my word for it. I don't want users to be confused by such an error handler name, so please change it ! Here's a list of the currently available error handlers (taken from codecs.py): The .encode()/.decode() methods may use different error handling schemes by providing the errors argument. These string values are predefined: 'strict' - raise a ValueError error (or a subclass) 'ignore' - ignore the character and continue with the next 'replace' - replace with a suitable replacement character; Python will use the official U+FFFD REPLACEMENT CHARACTER for the builtin Unicode codecs on decoding and '?' on encoding. 'xmlcharrefreplace' - Replace with the appropriate XML character reference (only for encoding). 'backslashreplace' - Replace with backslashed escape sequences (only for encoding). The set of allowed values can be extended via register_error.
Error handlers and codecs are two different things, so the namespaces need to be clearly separate.
They *are* separate naemspaces; that's guaranteed by the implementation.
In the implementation, yes, but not in the head of a typical user: the 'utf8b' looks more like a codec name than an error handler name. I want to avoid any such confusion with Python codecs and don't understand why you are making a problem out of this. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 06 2009)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2009-06-29: EuroPython 2009, Birmingham, UK 53 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
M.-A. Lemburg wrote:
Martin v. Löwis wrote:
The name "utf8b" suggested in the PEP is not in line with the codec design Where is that design documented, and how exactly violates the name the design (chapter and verse, please).
Martin, I designed the whole Python codec machinery, so even if this is not explicitly written down somewhere, you can take my word for it.
I don't want users to be confused by such an error handler name, so please change it !
Here's a list of the currently available error handlers (taken from codecs.py):
The .encode()/.decode() methods may use different error handling schemes by providing the errors argument. These string values are predefined:
'strict' - raise a ValueError error (or a subclass) 'ignore' - ignore the character and continue with the next 'replace' - replace with a suitable replacement character; Python will use the official U+FFFD REPLACEMENT CHARACTER for the builtin Unicode codecs on decoding and '?' on encoding. 'xmlcharrefreplace' - Replace with the appropriate XML character reference (only for encoding). 'backslashreplace' - Replace with backslashed escape sequences (only for encoding).
The set of allowed values can be extended via register_error.
Error handlers and codecs are two different things, so the namespaces need to be clearly separate. They *are* separate naemspaces; that's guaranteed by the implementation.
In the implementation, yes, but not in the head of a typical user: the 'utf8b' looks more like a codec name than an error handler name.
Judging by the existing names, I think that 'surrogate' would be reasonable. It already contains the meaning of substitute, it's not too long, and the codes which act as replacements are already called surrogates.
I want to avoid any such confusion with Python codecs and don't understand why you are making a problem out of this.
MRAB
Judging by the existing names, I think that 'surrogate' would be reasonable. It already contains the meaning of substitute,
Only if you are a native English-speaker I suppose... For me it's just a technical term denoting a certain class of unicode code points (I'm not sure of the latter terminology ;-)). Regards Antoine.
2009/5/6 Antoine Pitrou
By the way, what are the ASCII characters that are not suppported by Shift-JIS? Not many I suppose? (if I read the Wikipedia entry correctly, it's only the backslash and the tilde).
The biggest problem with Shift-JIS is that a perfectly valid unicode character above 127 can be encoded to a byte sequence that includes bytes in range(128). E.g. the character 掛 (a.k.a. '\u639b') when encoded with Shift-JIS becomes the two bytes sequence b'\x8a|'. Notice that the second byte is 124, which on POSIX is usually interpreted as the pipe character and can have security implications. It's a know problem with Shift-JIS and was fixed in UTF-8. -- Lino Mastrodomenico
On Wed, May 6, 2009 at 09:31, "Martin v. Löwis"
They *are* separate naemspaces; that's guaranteed by the implementation.
Yes. But utf8b *sounds like* an encoding. When it isn't. I sure thought it was when it was first mentioned. I agree that it would be better to find another name. 'utf8-binary-replace'? Is it only usable with utf8 as an encoding? -- Lennart Regebro: Python, Zope, Plone, Grok http://regebro.wordpress.com/ +33 661 58 14 64
"Martin v. Löwis" writes:
Yeah, yeah, this is the same old same old from PEP 3131. Anything that handles the various attacks based on ASCII-alike characters should at least rule out invalid Unicode, too!
And where is this U+DC2F supposed to be coming from, anyway? The user's *local* environment or the user's *local* filesystem!
Why is that not a threat? Suppose you have a setuid application, and you pass some string on the command line that decodes to /../. Then the setuid application will be tricked into modifying files it didn't mean to modify.
Of course this is a threat, assuming that the application takes no precautions. But first, it should be stopped by any of several standard precautions. For example, applying os.path.realpath (come to think of it, PEP 383 should say something about realpath, shouldn't it?) and os.path.normpath (PEP 383 should definitely say something about this function; maybe PEP 3131 should, too) before checking access restrictions. If you're not running your paths through those, you're already vulnerable to symlink attacks, and maybe other forms of spoofing. Second, it's a threat already enabled by your restricted version of PEP 383. Access control applies to subdirectories as well as to parent directories. Since you can insert arbitrary non-ASCII bytes into the path using the current definition of 'utf8b', name-based access restrictions can be bypassed in exactly the same way for any directory whose name is not 100.00% ASCII, and the setuid application will be tricked into modifying files it didn't mean to modify. Also, on Mac OS X, system directories, including directories containing system libraries, frameworks, and executables, may be accessible via locale-specific names (I don't have a Japanese- localized Mac at hand to check, but I'm pretty sure in my old Mac the Japanese names appeared in ls in Terminal.app, which means it may be possible to access system directories containing libraries, frameworks, and executables this way). Those can be spoofed in exactly the same way.
Nothing is lost at the moment.
Nothing is lost compared to 'strict', true, but under the PEP as it is a large fraction of Shift JIS and Big5 filenames cannot be read under ASCII-compatible file system encodings using 'utf8b'. Yet it is those users who are placed at risk by PEP 383.
In any case, Python 3.1b1 may get released today, so it's way too late for new features in the PEP. They can wait for Python 3.2.
You have convinced me that the PEP should wait as well. In its current form it is incomplete and dangerous.
Stephen J. Turnbull
Nothing is lost compared to 'strict', true, but under the PEP as it is a large fraction of Shift JIS and Big5 filenames cannot be read under ASCII-compatible file system encodings using 'utf8b'.
You should really be more specific. I'm not sure about others, but I don't understand what filenames you are talking about.
On May 6, 2009, at 7:33 AM, Stephen J. Turnbull wrote:
You have convinced me that the PEP should wait as well.
In its current form it is incomplete and dangerous.
+1 on delaying PEP 383 I think PEP 383 is a good idea in principle, but I'm still struggling to understand it myself, and it seems to offer new hazards for the unwary programmer. On the other hand, maybe the wary programmers are waiting for Python 3.2 anyway <wink>. On the gripping hand, if PEP 383 is released in Python 3.1, will that obligate python-dev to support it indefinitely, at least in backwards- compatibility mode? I'm not thinking of API compatibility as much as data compatibility -- someone used Python 3.1 to write down some filenames, and now a few years later they are trying to use the latest and greatest Python release to read those filenames... Regards, Zooko
On Wed, 6 May 2009 at 13:40, Antoine Pitrou wrote:
Stephen J. Turnbull
writes: Nothing is lost compared to 'strict', true, but under the PEP as it is a large fraction of Shift JIS and Big5 filenames cannot be read under ASCII-compatible file system encodings using 'utf8b'.
You should really be more specific. I'm not sure about others, but I don't understand what filenames you are talking about.
Seems to me that the best thing to do would be to file a bug report with test cases that demonstrate the problems when run against the current py3k trunk. Especially the security issues you cite (which I don't understand). --David
On May 6, 2009, at 5:39 AM, Stephen J. Turnbull wrote:
Now, with Python's file system encoding == UTF-8 or any packed EUC, and more than a handful of Shift JIS or Big5 characters in file names, one is *almost certain* to encounter ASCII as the second byte of a multibyte sequence. PEP 383 can't handle this
Hm, I haven't tried the implementation, but I thought that what would happen is: '\x85a'.decode('utf-8', 'utf8b/surrogate-replace/whateveritscalled') -
u'\uDC85a'
If that indeed doesn't happen, that's certainly a defect and should be remedied.
, but it is sure to be the most common use case for PEP 383 in East Asia.
Yes. James
Zooko Wilcox-O'Hearn
I'm not thinking of API compatibility as much as data compatibility -- someone used Python 3.1 to write down some filenames, and now a few years later they are trying to use the latest and greatest Python release to read those filenames...
Well, if the filenames are generated by Python (as opposed to read from an existing directory on disk), they should be regular unicode objects without any lone surrogates, so I don't see the compatibility problem. Regards Antoine.
On approximately 5/6/2009 6:33 AM, came the following characters from the keyboard of Stephen J. Turnbull:
"Martin v. Löwis" writes:
In any case, Python 3.1b1 may get released today, so it's way too late for new features in the PEP. They can wait for Python 3.2.
You have convinced me that the PEP should wait as well.
In its current form it is incomplete and dangerous.
I see nothing in this thread that suggests that the PEP is dangerous in its current form. While I (still) think that more readable transcodings could have been used, and while I had difficulty fully understanding the PEP at first, now that I think I do understand the PEP, and it has been somewhat clarified and amended, I cannot see how it could be dangerous. A specific case of danger should be included with such a statement. Regarding incomplete, I agree it won't brush my teeth for me, but I think it does solve the problem it sets out to solve. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
On approximately 5/6/2009 3:08 AM, came the following characters from the keyboard of MRAB:
M.-A. Lemburg wrote:
Martin v. Löwis wrote:
Judging by the existing names, I think that 'surrogate' would be reasonable. It already contains the meaning of substitute, it's not too long, and the codes which act as replacements are already called surrogates.
I want to avoid any such confusion with Python codecs and don't understand why you are making a problem out of this.
+1 for "surrogate" as the name for the error handler. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
On approximately 5/6/2009 12:53 AM, came the following characters from the keyboard of Martin v. Löwis:
Sorry! I suggest substituting the paragraph above for the paragraph which begins "The encode error handler interface presentlyrequires..." at line 129.
Ah, ok. This was Glen Linderman's text before - now it's yours :-)
Which is fine by me. Stephen's is more explanatory than mine, but says the same thing. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
Glenn Linderman wrote:
On approximately 5/6/2009 3:08 AM, came the following characters from the keyboard of MRAB:
M.-A. Lemburg wrote:
Martin v. Löwis wrote:
Judging by the existing names, I think that 'surrogate' would be reasonable. It already contains the meaning of substitute, it's not too long, and the codes which act as replacements are already called surrogates.
I want to avoid any such confusion with Python codecs and don't understand why you are making a problem out of this.
+1 for "surrogate" as the name for the error handler.
+1 from me also
On May 6, 2009, at 10:54 AM, Antoine Pitrou wrote:
Zooko Wilcox-O'Hearn
writes: I'm not thinking of API compatibility as much as data compatibility -- someone used Python 3.1 to write down some filenames, and now a few years later they are trying to use the latest and greatest Python release to read those filenames...
Well, if the filenames are generated by Python (as opposed to read from an existing directory on disk), they should be regular unicode objects without any lone surrogates, so I don't see the compatibility problem.
I meant that the application reads filenames from an existing directory on disk, saves those filenames, and then later, using a future version of Python, wants to read them and use them. I'm not saying that I know this would be a problem. I'm saying that I personally can't tell whether it would be a problem or not, and the extensive discussions so far have not convinced me that there is anyone who both understands PEP 383 and considers this use case. Many people who apparently understand encoding issues well have said something to the effect that there is no problem, but those people haven't yet managed to get through my thick skull how I would use PEP 383 safely for this sort of use case -- the one where data generated by os.listdir() travels forward in time or the one were that data travels sideways to other systems, including Windows or other systems that validate incoming unicode. That's why I am a bit uncomfortable about PEP 383 being quickly implemented and deployed in Python 3.1. By the way, much of the detailed discussion about what Tahoe requires and how that may or may not benefit from PEP 383 has now moved to the tahoe-dev mailing list: http://allmydata.org/cgi-bin/mailman/listinfo/ tahoe-dev . Regards, Zooko
On approximately 5/6/2009 12:18 PM, came the following characters from the keyboard of Zooko Wilcox-O'Hearn:
On May 6, 2009, at 10:54 AM, Antoine Pitrou wrote:
Zooko Wilcox-O'Hearn
writes: I'm not thinking of API compatibility as much as data compatibility -- someone used Python 3.1 to write down some filenames, and now a few years later they are trying to use the latest and greatest Python release to read those filenames...
Well, if the filenames are generated by Python (as opposed to read from an existing directory on disk), they should be regular unicode objects without any lone surrogates, so I don't see the compatibility problem.
I meant that the application reads filenames from an existing directory on disk, saves those filenames, and then later, using a future version of Python, wants to read them and use them.
Regarding future versions of Python. In the worst case, even if Python's default behavior changes, the transcoding done by PEP 383 can be done in other software too... it is a straightforward, fully specified, 1-to-1, reversible transcoding process, affecting and generating only invalid byte encodings on one side, and invalid Unicode sequences on the other. So if Python's default behavior should change, the transcoding implemented by PEP 383 could be easily reimplemented to enable a future version of a Python application to manipulate the transcoded, saved, filenames. By easily, I mean that I could code it in a couple hours, max.
I'm not saying that I know this would be a problem. I'm saying that I personally can't tell whether it would be a problem or not, and the extensive discussions so far have not convinced me that there is anyone who both understands PEP 383 and considers this use case.
Does the above help?
Many people who apparently understand encoding issues well have said something to the effect that there is no problem, but those people haven't yet managed to get through my thick skull how I would use PEP 383 safely for this sort of use case -- the one where data generated by os.listdir() travels forward in time or the one were that data travels sideways to other systems, including Windows or other systems that validate incoming unicode.
Regarding data traveling sideways, some comments: 1) PEP 383's effect could be recoded in other languages as easily as it is in Python (or the C in which Python is implmented). So that could be a solution. 2) You mention "Windows" and "other systems that validate incoming unicode" in the same phrase, as if you think that "Windows" qualifies as an "other systems that validate incoming unicode", but it does not (at least not universally).
That's why I am a bit uncomfortable about PEP 383 being quickly implemented and deployed in Python 3.1.
Does the above help?
By the way, much of the detailed discussion about what Tahoe requires and how that may or may not benefit from PEP 383 has now moved to the tahoe-dev mailing list: http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev .
I have no background with Tahoe, nor particular interest, although it sounds like a useful project... so I won't be joining that list. I have no idea if there is an installed base of existing Tahoe file systems, my suggestions below assume that there is not, and that you are presently inventing them. Therefore, I provide no migration path, although I could invent one, but it would take longer to describe. However, since I'm responding here, and have read what you have posted here, it seems like the following could be true. Assumptions from your emails: A) Tahoe wants to provide a UTF-8 file name system B) Tahoe wants to interface to POSIX systems that use (and do not validate) byte interfaces. C) Tahoe wants to interface to non-POSIX systems that use 16-bit file name interfaces, with no validation. D) Tahoe wants to interface to non-POSIX systems that use 16-bit file name interfaces, with validation. Uncertainties: I'm not clear on what your goals are for Tahoe filenames. There seem to be 2 possibilities: 1) you want to reject attempts to use non-validating Unicode, be it from a 16-bit interface, or a bytes interface. 2) you don't want to reject non-validating Unicode, but you want to convert it to valid Unicode for (D) systems. 3) Orthogonally, you might want to store only Valid Unicode in the names, or you might not care, if you can meet the other goals. Truisms: If you want to support (D), and (2), then you must transform names at some point, using some scheme, because not all names supplied by (B) systems will be acceptable to (D) systems. You can choose to do this transformation when a (B) system provides an invalid (per Unicode) name, or you can choose to do the transformation when a (D) system accesses a file with an invalid (per Unicode) name. If the (B) and (D) systems talk to each other outside of Tahoe, they will have to do similar transformations, or, if they both access the same Tahoe system, they will have to do the identical transformation, to be sure that they can access the same file. All transcoding schemes have the possibility of data puns between non-transcoded names and transcoded names. In order to successfully and properly manipulate a name, you must know whether or not it has been transcoded, and how. PEP 383 limits its transcoding to names that are invalid (per Unicode). Names that cannot be properly decoded to Unicode are decoded to invalid Unicode. Names that are invalid Unicode are encoded to invalid byte sequences (per the encoding scheme specified). For PEP 383 and Python, transcoded names can be distinguished by checking for the existence of lone surrogates in the str form of the filename, or by attempting to do a strict decoding of the bytes form of the filename, depending on what you have (generally, the former). For PEP 383 and Python, the names will round trip from the POSIX bytes interfaces to the program, and back to POSIX bytes interfaces, as long as only Python wrappers of system functions are used, and the filesystem encoding is not changed between calls (or is restored). Passing them to 3rd party libraries or other systems requires extra work, if there is a desire to manipulate files with names that are not decodeable to Unicode by the standard decoding algorithm for that encoding. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
I'm sorry for the lack of clarity of my posts, but somehow you're completely missing the point. The point is precisely that Python *won't* use Shift JIS as the file system encoding (if it did there would be no problem with reading Shift JIS), but the people who created the media *did*.
Now, with Python's file system encoding == UTF-8 or any packed EUC, and more than a handful of Shift JIS or Big5 characters in file names, one is *almost certain* to encounter ASCII as the second byte of a multibyte sequence. PEP 383 can't handle this
Not true. PEP 383 handles this very example just fine, with no problems that I can see. Can you propose a specific example that you think might cause problems? By "specific", I mean: what file names (exact bytes, please), what locale charset, what API calls. Regards, Martin
The name "utf8b" suggested in the PEP is not in line with the codec design Where is that design documented, and how exactly violates the name the design (chapter and verse, please).
Martin, I designed the whole Python codec machinery
Not true. PEP 293 was written and designed by Walter Dörwald.
so even if this is not explicitly written down somewhere, you can take my word for it.
If the design was specified in writing somewhere, I would probably challenge it as obsolete. If it isn't described anywhere, I'll have to ignore it.
I want to avoid any such confusion with Python codecs and don't understand why you are making a problem out of this.
Because utf8b (or, perhaps "UTF-8b") is the official name for this algorithm: http://hyperreal.org/~est/utf-8b/ Regards, Martin
Terry Reedy wrote:
Glenn Linderman wrote:
On approximately 5/6/2009 3:08 AM, came the following characters from the keyboard of MRAB:
M.-A. Lemburg wrote:
Martin v. Löwis wrote:
Judging by the existing names, I think that 'surrogate' would be reasonable. It already contains the meaning of substitute, it's not too long, and the codes which act as replacements are already called surrogates.
I want to avoid any such confusion with Python codecs and don't understand why you are making a problem out of this.
+1 for "surrogate" as the name for the error handler.
+1 from me also
Despite there being also an error handler called "surrogates". Are you serious? Regards, Martin
Martin v. Löwis
Despite there being also an error handler called "surrogates".
People, perhaps we could end all the bikeshedding and call one of those handlers "surrogates-pass" and the other "surrogates-escape", which sounds quite faithful to what they actually /do/? Regards Antoine.
But first, it should be stopped by any of several standard precautions. For example, applying os.path.realpath (come to think of it, PEP 383 should say something about realpath, shouldn't it?)
Why do you think so? I think the existing documentation of realpath is correct and complete.
and os.path.normpath (PEP 383 should definitely say something about this function
Precisely what?
maybe PEP 3131 should, too)
How can this be of relevance?
Nothing is lost at the moment.
Nothing is lost compared to 'strict', true, but under the PEP as it is a large fraction of Shift JIS and Big5 filenames cannot be read under ASCII-compatible file system encodings using 'utf8b'. Yet it is those users who are placed at risk by PEP 383.
I think this statement is incorrect. Those filenames *can* be read just fine. Regards, Martin
Antoine Pitrou wrote:
Martin v. Löwis
writes: Despite there being also an error handler called "surrogates".
People, perhaps we could end all the bikeshedding and call one of those handlers "surrogates-pass" and the other "surrogates-escape", which sounds quite faithful to what they actually /do/?
The problem with these bike-shedding discussions is that you cannot stop them with a proposal. People will counter-propose. I would be willing to accept a ruling from someone who a) is a native speaker of English, and b) has demonstrated to fully understand what these do, and c) has understood why I insist on calling it utf8b. Regards, Martin
Martin v. Löwis wrote:
+1 for "surrogate" as the name for the error handler.
+1 from me also
Despite there being also an error handler called "surrogates".
Given that additional information which MAL apparently omitted, I would revise.
Are you serious?
Are you? ;-? You are the one naming a codec-agnostic error handler (if I understand correctly, and correct me if I do not) after a particular codec, and denying that that could cause confusion. See other message. Terry Jan Reedy
2009/5/6 Antoine Pitrou
Martin v. Löwis
writes: Despite there being also an error handler called "surrogates".
People, perhaps we could end all the bikeshedding and call one of those handlers "surrogates-pass" and the other "surrogates-escape", which sounds quite faithful to what they actually /do/?
We could also stop the bikeshedding by sticking with the name utf8b. Martin's comment that it is the official name for this algorithm seems compelling to me (even if it is confusing because of its similarity with utf-8). Paul.
Martin v. Löwis wrote:
Because utf8b (or, perhaps "UTF-8b") is the official name for this algorithm: http://hyperreal.org/~est/utf-8b/
Thank you for the link. It starts: "This directory contains a C implementation of a UTF-8b codec. A Python codec based on it is provided as well." 'RTF-8b' consists, obviously, 'UTF-8' plus 'b', with the 'b' signifying a variation of or addition to UTF-8. The 'b', and only the 'b', refers to the innovative error-handler that was added to the existing 'UTF-8' codec/algorithm. The name of the combined whole is not the name of the part. If you were incorporating the Python-wrapped utf-8b *codec* as a codec, which is what I once thought *because you used that name*, then calling it 'utf-8b' would be fine. But you apparently instead proposed and implemented an *error-handler*, which seems to me to be something else, and which will not be specific to utf-8 but usable with any codec. Hence some of us think it should have a different name. I gather that you lifted the error-handler part of the algorithm and propose to use it with *any* ascii-respecting codec. I could claim that the 'official name' of that part is 'b', but I think we can find a better name. Terry Jan Reedy
Martin v. Löwis wrote:
Antoine Pitrou wrote:
Martin v. Löwis
writes: Despite there being also an error handler called "surrogates". People, perhaps we could end all the bikeshedding and call one of those handlers "surrogates-pass" and the other "surrogates-escape", which sounds quite faithful to what they actually /do/?
The problem with these bike-shedding discussions is that you cannot stop them with a proposal. People will counter-propose.
I would be willing to accept a ruling from someone who a) is a native speaker of English, and b) has demonstrated to fully understand what these do, and c) has understood why I insist on calling it utf8b.
I qualify with a). I believe I understand c) but, as explained in my other post, I do not think your reason applies. In fact, I think concern for naming rights might suggest that you *not* reuse the name for something different. I would have to learn more about the existing 'surrogates' handler to judge Antione's suggestion 'surrogates-pass'. 'Surrogates-escape' is pretty good for the new handler since, to my understanding, it 'escapes' 'bad bytes' by prefixing them with bits that push them to the surrogates plane. I have been supportive of the idea and, as well as I understood them, the particulars of your proposal, from the beginning. Reusing the name of a codec as the name of an error-handler confused me and I believe it will confuse others, even though, but also because, the error handler was extracted and generalized from the codec. Terry Jan Reedy
Are you serious?
Are you? ;-? You are the one naming a codec-agnostic error handler (if I understand correctly, and correct me if I do not) after a particular codec, and denying that that could cause confusion. See other message.
I can only repeat what I said before: I call it utf8b because that's the established name for the algorithm it implements. That algorithm was originally designed with UTF-8 in mind (and only meant to be applied for UTF-8), however, it remains the same algorithm even though PEP 383 widens its application. Regards, Martin
Antoine Pitrou wrote:
Martin v. Löwis
writes: Despite there being also an error handler called "surrogates".
People, perhaps we could end all the bikeshedding and call one of those handlers "surrogates-pass" and the other "surrogates-escape", which sounds quite faithful to what they actually /do/?
After having read about the existing error handler called "surrogates" and having thought about it, I've decided that calling one just "surrogates" isn't very helpful to the user; it has something to do with surrogates, but what? So +1 for Antoine's suggestion from me.
I qualify with a). I believe I understand c) but, as explained in my other post, I do not think your reason applies. In fact, I think concern for naming rights might suggest that you *not* reuse the name for something different. I would have to learn more about the existing 'surrogates' handler to judge Antione's suggestion 'surrogates-pass'. 'Surrogates-escape' is pretty good for the new handler since, to my understanding, it 'escapes' 'bad bytes' by prefixing them with bits that push them to the surrogates plane.
See issue 3672. In essence, in python 2.5: py> u"\ud800".encode("utf-8") '\xed\xa0\x80' py> '\xed\xa0\x80'.decode("utf-8") u'\ud800' In 3.1, py> "\ud800".encode("utf-8") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed py> "\ud800".encode("utf-8","surrogates") b'\xed\xa0\x80' py> b'\xed\xa0\x80'.decode("utf-8") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2: illegal encoding py> b'\xed\xa0\x80'.decode("utf-8","surrogates") '\ud800' Regards, Martin
Martin v. Löwis
py> b'\xed\xa0\x80'.decode("utf-8","surrogates") '\ud800'
The point is, "surrogates" does not mean anything intuitive for an /error handler/. You seem to be the only one who finds this name explicit enough, perhaps because you chose it. Most other handlers' names have verbs in them ("ignore", "replace", "xmlcharrefreplace", etc.). Regards Antoine.
On Wed, May 6, 2009 at 15:42, "Martin v. Löwis"
Despite there being also an error handler called "surrogates".
Not that I have to be, but I'm not sold on the previous UTF-8 codec behavior becoming an error handler of the name "surrogates" for two reasons (I do respect the obvious PBP argument for the implementation, and have no better name - "lenient"?). First, unless there's a way to stack error handlers, there's no way to access the old behavior combined with the "replace" handler. Second, errors="surrogates" reads like surrogates should be an error, not an additionally allowed pattern. Neither of these are deal breakers or hard to learn, but they are non-obvious. I think the utf8b behavior makes a lot more sense with the name "surrogates", through the mnemonic that errors become surrogates. The stacking argument also applies to the new utf8b behavior on encode (only, as it handles all errors on decode). This may be a YAGNI, but for a non-UTF-8 encode, it may be useful to allow "xmlcharrefreplace" handling for unavailable non-surrogate-escaped characters. But without stacking that's unmaintainable, as we clearly don't want ${codec}b for all current codecs. I'd be perfectly happy with utf8b or UTF-8b, as either a codec or an error handler (do we want both? YAGNI?). So what if it smells a little inaccurate as a handler when used with codecs other than UTF-8, no big deal. I could also see something like errors="roundtrip" which explains the intention of the handler rather than the algorithm, but is awkward on encode when it encounters unavailable Unicode characters. -- Michael Urman
Martin v. Löwis wrote:
The name "utf8b" suggested in the PEP is not in line with the codec design Where is that design documented, and how exactly violates the name the design (chapter and verse, please). Martin, I designed the whole Python codec machinery
Not true. PEP 293 was written and designed by Walter Dörwald.
Walter added the generic error handler callback mechanism and we both worked on their design. I designed and wrote the codec implementation back in 2000, which included the whole idea of having codec error handlers in the first place. The original implementation only allowed per-codec error handlers. Walter extended this to build general-purpose handlers that could be used by many codecs. His original motivation was to be able to do XML character reference escaping. If you don't believe me, go look this up in the repository, the mailing list archives and the trackers.
so even if this is not explicitly written down somewhere, you can take my word for it.
If the design was specified in writing somewhere, I would probably challenge it as obsolete. If it isn't described anywhere, I'll have to ignore it.
Ah, lovely attitude.
I want to avoid any such confusion with Python codecs and don't understand why you are making a problem out of this.
Because utf8b (or, perhaps "UTF-8b") is the official name for this algorithm:
That's a codec implementing the escaping idea proposed by Markus Kuhn, not an official reference. AFAIK, the term "UTF-8B" originated from a "UTF-8 + binary" codec written for iconv: http://mail.nl.linux.org/linux-utf8/2006-04/msg00002.html If it were the official name of an escape algorithm, as you are suggesting, the inventor Markus Kuhn would probably have chosen it, but he hasn't... the only reference to it is an email where it is described as option D for ways of dealing with malformed UTF-8 data in a decoder: http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html Note that this escape method is not applicable for data that you decode from UTF-8 and then e.g. encode as Latin-1. It only works as general purpose method if you are decoding and encoding using the same codec, since it is specifically designed to assure round-trip safety. Martin, please stop being silly and just change the name. Or drop the idea of using an error handler altogether and just let people use the utf-8b codec you referenced above to solve their problems whereever and if needed. Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 07 2009)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2009-06-29: EuroPython 2009, Birmingham, UK 52 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
"Martin v. Löwis" writes:
Now, with Python's file system encoding == UTF-8 or any packed EUC, and more than a handful of Shift JIS or Big5 characters in file names, one is *almost certain* to encounter ASCII as the second byte of a multibyte sequence. PEP 383 can't handle this
Ah, I see. Of course, the algorithm not only has to handle the ASCII octet which is erroneous because it can't be a trailing byte, but *also the leading byte that signalled to expect a trailing byte >127*. So the algorithm backs up to the character boundary (which is well-defined for all the "sane" encodings), encode the high byte(s) in the character with lone surrogates, and encode the ASCII as itself (promoted to a Unicode code point). Sorry, you're right, I was just confused. I withdraw the objection as completely mistaken, and apologize for not thinking more carefully in the first place.
Martin v. Löwis wrote:
Are you serious? Are you? ;-? You are the one naming a codec-agnostic error handler (if I understand correctly, and correct me if I do not) after a particular codec, and denying that that could cause confusion. See other message.
I can only repeat what I said before: I call it
What, specifically, is 'it'?
utf8b because that's the established name for the algorithm
Which algorithm?
it implements.
Again, what is 'it'? As *I* read the sentence above, it is not true. I went to the site you referred to as the source of your reasoning and specifically http://hyperreal.org/~est/utf-8b/releases/utf-8b-20060413043934/utf_8b.c The algorithm called utf-8b *IS* utf-8 with the addition or replacement (of an error return) of essentially one line in each direction: # encode if 0xDC00 <= codepoint <= 0xDCFF: byte = codepoint - 0xDC00 #encode Note: for security concerns, you are increasing the lower limit to 0xDC80. The comment at the top of the utf_8b.c, suggests that that is what it should be and should have been in the file, with the other half of that surrogate area an error along with the other surrogate area. #decode if (0x80 <= byte <= 0xFF) and utf-8-invalid(byte): codepoint = byte + 0xDC00 # decode
That algorithm was originally designed with UTF-8 in mind (and only meant to be applied for UTF-8), however, it remains the same algorithm even though PEP 383 widens its application.
The error handler designed with utf-8 in mind has no name in the encode direction and is called "utf_8b_decoder_invalid_bytes" in the decode direction. By your reasoning, *that* should be its name in Python. The encoding error handler would then be named analogously "utf_8b_encoder_invalid_codepoints". Even these, to me, would be better than confusing giving them the same name as the codec. Terry Jan Reedy
On approximately 5/6/2009 6:06 PM, came the following characters from the keyboard of M.-A. Lemburg:
Martin, please stop being silly and just change the name.
Yes, please. If indeed Marc-Andre invented the codec business as he claims, he would be an appropriate person to give a fiat name to the error handler.
Or drop the idea of using an error handler altogether and just let people use the utf-8b codec you referenced above to solve their problems whereever and if needed.
The design as an error handler is clever in leveraging the same error handler for multiple codecs, which cannot be done by using utf-8b alone, if I understand correctly. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
Michael Urman wrote:
On Wed, May 6, 2009 at 15:42, "Martin v. Löwis"
wrote: Despite there being also an error handler called "surrogates".
Not that I have to be, but I'm not sold on the previous UTF-8 codec behavior becoming an error handler of the name "surrogates" for two reasons (I do respect the obvious PBP argument for the implementation, and have no better name - "lenient"?).
PBP?
First, unless there's a way to stack error handlers, there's no way to access the old behavior combined with the "replace" handler.
Well, there is a way to stack error handlers, although it's not pretty: _surrogates = codecs.lookup_errors("surrogates") _replace = codecs.lookup_errors("replace") def surrogates_then_replace(exc): try: return _surrogates(exc) except UnicodeError: return _replace(exc) codecs.register_error("surrogates_then_replace", surrogates_then_replace)
The stacking argument also applies to the new utf8b behavior on encode (only, as it handles all errors on decode). This may be a YAGNI
Indeed - in particular, as, in the primary application of this error handler (i.e. file IO operations), there is no way of specifying an addition error handler anyway. Regards, Martin
The error handler designed with utf-8 in mind has no name in the encode direction and is called "utf_8b_decoder_invalid_bytes" in the decode direction. By your reasoning, *that* should be its name in Python. The encoding error handler would then be named analogously "utf_8b_encoder_invalid_codepoints". Even these, to me, would be better than confusing giving them the same name as the codec.
So are you proposing that I should rename the PEP 383 handler to "utf_8b_encoder_invalid_codepoints"? Regards, Martin
On approximately 5/6/2009 10:53 PM, came the following characters from the keyboard of Martin v. Löwis:
The error handler designed with utf-8 in mind has no name in the encode direction and is called "utf_8b_decoder_invalid_bytes" in the decode direction. By your reasoning, *that* should be its name in Python. The encoding error handler would then be named analogously "utf_8b_encoder_invalid_codepoints". Even these, to me, would be better than confusing giving them the same name as the codec.
So are you proposing that I should rename the PEP 383 handler to "utf_8b_encoder_invalid_codepoints"?
No, he's saying that your algorithm for choosing the PEP 383 handler should have come up with that name, rather than utf8b. But since PEP 383 applies to other codecs besides UTF-8, it should have a different name. And one that is less cumbersome than "utf_8b_encoder_invalid_codepoints" -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
By the way, what are the ASCII characters that are not suppported by Shift-JIS? Not many I suppose? (if I read the Wikipedia entry correctly, it's only the backslash and the tilde).
The problem with this encoding is that bytes below 128 appear as second bytes of a two-byte encoding: py> "\x81@".decode("shift-jis") u'\u3000' py> "\x81A".decode("shift-jis") u'\u3001' So in on decoding, it may be the second byte (i.e. the ASCII byte) that causes a problem: py> "\x81/".decode("shift-jis") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'shift_jis' codec can't decode bytes in position 0-1: illegal multibyte sequence For the shift-jis codec, that's actually not a problem, though: py> b"\x81/".decode("shift-jis","utf8b") '\udc81/' so the utf8b error handler will escape the first of the two bytes, and then pass the second byte to the codec again, which then decodes as ASCII. Regards, Martin
So are you proposing that I should rename the PEP 383 handler to "utf_8b_encoder_invalid_codepoints"?
No, he's saying that your algorithm for choosing the PEP 383 handler should have come up with that name, rather than utf8b. But since PEP 383 applies to other codecs besides UTF-8, it should have a different name. And one that is less cumbersome than "utf_8b_encoder_invalid_codepoints"
I'm still at a loss what name to give it, though. I understand that I have to rename both error handlers, but I'm uncertain what I should rename them to. So proposals that rename only one of them aren't that helpful. It would be helpful if people would indicate support for Antoine's proposal. Regards, Martin
On approximately 5/6/2009 11:16 PM, came the following characters from the keyboard of Martin v. Löwis:
So are you proposing that I should rename the PEP 383 handler to "utf_8b_encoder_invalid_codepoints"?
No, he's saying that your algorithm for choosing the PEP 383 handler should have come up with that name, rather than utf8b. But since PEP 383 applies to other codecs besides UTF-8, it should have a different name. And one that is less cumbersome than "utf_8b_encoder_invalid_codepoints"
I'm still at a loss what name to give it, though. I understand that I have to rename both error handlers, but I'm uncertain what I should rename them to. So proposals that rename only one of them aren't that helpful. It would be helpful if people would indicate support for Antoine's proposal.
Wouldn't renaming the existing "surrogates" handler be an incompatible change, and thus inappropriate? I assume that is the second handler you are referring to? "bytes-as-lone-surrogates" That would be very descriptive of the decode case for PEP 383, but very long. One problem with the word "surrogates" is that anything you add to it makes it too long. "bytes-ls" This is short, but a meaningless as is -- however, adding the understanding via documentation that "ls" means "lone surrogates" would make it meaningful, and mnemonic. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
Antoine Pitrou wrote:
Martin v. Löwis
writes: py> b'\xed\xa0\x80'.decode("utf-8","surrogates") '\ud800'
The point is, "surrogates" does not mean anything intuitive for an /error handler/. You seem to be the only one who finds this name explicit enough, perhaps because you chose it. Most other handlers' names have verbs in them ("ignore", "replace", "xmlcharrefreplace", etc.).
Correct. The purpose of an error handler name is to indicate to the user what it does, hence the use of verbs. Walter started with "xmlcharrefreplace", ie. no space names, so "surrogatereplace" would be the logically correct name for the "replace with lone surrogates" scheme invented by Markus Kuhn. The error handler for undoing this operation (ie. when converting a Unicode string to some other encoding) should probably use the same name based on symmetry and the fact that the escaping scheme is meant to be used for enabling round-trip safety. BTW: It would also be appropriate to reference Markus Kuhn in the PEP as the inventor of the escaping scheme. Even if only to give the reader an idea of how that scheme works and why (the PEP on python.org currently doesn't explain this). It should also explain that the scheme is meant to assure round-trip safety and doesn't necessarily work when using transcoding, ie. reading using one encoding, writing using another. Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 07 2009)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2009-06-29: EuroPython 2009, Birmingham, UK 52 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
M.-A. Lemburg wrote:
Antoine Pitrou wrote:
py> b'\xed\xa0\x80'.decode("utf-8","surrogates") '\ud800' The point is, "surrogates" does not mean anything intuitive for an /error handler/. You seem to be the only one who finds this name explicit enough,
Martin v. Löwis
writes: perhaps because you chose it. Most other handlers' names have verbs in them ("ignore", "replace", "xmlcharrefreplace", etc.). Correct.
The purpose of an error handler name is to indicate to the user what it does, hence the use of verbs.
Walter started with "xmlcharrefreplace", ie. no space names, so "surrogatereplace" would be the logically correct name for the "replace with lone surrogates" scheme invented by Markus Kuhn.
"surrogatepass" (for the "don't complain about lone half surrogates" handler) and "surrogatereplace" sound OK to me. However the other "...replace" handlers are destructive (i.e. when such a "...replace" handler is used for encoding, decoding will not produce the original unicode string). The purpose of the PEP 383 error handler however is to be roundtrip safe, so maybe we should choose a slightly different name? How about "surrogateescape"?
The error handler for undoing this operation (ie. when converting a Unicode string to some other encoding) should probably use the same name based on symmetry and the fact that the escaping scheme is meant to be used for enabling round-trip safety.
We have only one error handler registry, but we *can* have one error handler for both directions (encoding and decoding) as the error handler can simply check whether it got passed a UnicodeEncodeError or UnicodeDecodeError object.
BTW: It would also be appropriate to reference Markus Kuhn in the PEP as the inventor of the escaping scheme.
Even if only to give the reader an idea of how that scheme works and why (the PEP on python.org currently doesn't explain this).
It should also explain that the scheme is meant to assure round-trip safety and doesn't necessarily work when using transcoding, ie. reading using one encoding, writing using another.
Servus, Walter
Martin v. Löwis wrote:
Wouldn't renaming the existing "surrogates" handler be an incompatible change, and thus inappropriate?
No - it's new in Python 3.1.
So what do you think about Antoine's proposal?
+1 Although it looks like it would be without the '-' for consistency with existing error handlers.
On Thu, May 7, 2009 at 00:43, "Martin v. Löwis"
Michael Urman wrote:
On Wed, May 6, 2009 at 15:42, "Martin v. Löwis"
wrote: Despite there being also an error handler called "surrogates".
Not that I have to be, but I'm not sold on the previous UTF-8 codec behavior becoming an error handler of the name "surrogates" for two reasons (I do respect the obvious PBP argument for the implementation, and have no better name - "lenient"?).
PBP?
From a practicality standpoint, it's presumably much more convenient to implement it on top of the new valid UTF-8 codec's behavior. And
Practicality beats purity. From a purity standpoint, the legacy invalid utf-8 seems more like an encoding than an error handler to me. then any error handler needs a name.
Well, there is a way to stack error handlers, although it's not pretty: [...] codecs.register_error("surrogates_then_replace", surrogates_then_replace)
That mitigates my arguments significantly, although I'd rather see something like errors=('surrogates', 'replace') chain the handlers without additional registrations. But that's a different PEP or arbitrary change. :)
The stacking argument also applies to the new utf8b behavior on encode (only, as it handles all errors on decode). This may be a YAGNI
Indeed - in particular, as, in the primary application of this error handler (i.e. file IO operations), there is no way of specifying an addition error handler anyway.
Would it be useful to allow setting this somewhere? It'd be analogous to setfsencoding, perhaps a setfsencodingerrors. It's not hard to imagine an application working on Windows where all Unicode characters are valid, and constructing backup filenames by adding some arbitrary character, or receiving them from a user who doesn't understand encodings. When this application is taken to a non-Unicode filesystem, without the ability to say "I really want a valid filename: so replace", that could get messy. But it may still be a YAGNI, or a "don't do that." -- Michael Urman
On Thu, May 7, 2009 at 01:16, "Martin v. Löwis"
I'm still at a loss what name to give it, though. I understand that I have to rename both error handlers, but I'm uncertain what I should rename them to. So proposals that rename only one of them aren't that helpful. It would be helpful if people would indicate support for Antoine's proposal.
Part of the problem is they both allow byte sequences to decode to invalid Unicode strings, and in particular they both affect the same byte subsequences, and that brought us to the crossroads where we wanted to name both of them "surrogates". So I'll offer a few more colors, and try to get out of the way of choosing between them or the other proposed ones. :) I haven't come up with anything I like better than errors="lenient" for the old utf8 behavior handler; would errors="nonvalidating" be correct? It still seems to me that a new codec, perhaps "utf8-lenient", reads better. For the utf8b error handler, I could see any of errors="roundtrip", errors="roundtripreplace", errors="tosurrogate", errors="surrogatereplace", errors="surrogateescape", errors="binaryreplace", errors="binaryescape". This includes Antoine's proposal (sans hyphen). -- Michael Urman
Michael Urman wrote:
[...]
Well, there is a way to stack error handlers, although it's not pretty: [...] codecs.register_error("surrogates_then_replace", surrogates_then_replace)
That mitigates my arguments significantly, although I'd rather see something like errors=('surrogates', 'replace') chain the handlers without additional registrations. But that's a different PEP or arbitrary change. :)
The first version of PEP 293 changed the errors argument to be a string or callable. This would have simplified handler stacking somewhat (because you don't have to register or lookup handlers) but it had the disadvantage that many "char *" arguments in the C API would have had to changed to "PyObject *". Changing the errors argument to a list of strings would have the same problem. Servus, Walter
Walter Dörwald wrote:
Michael Urman wrote:
[...]
Well, there is a way to stack error handlers, although it's not pretty: [...] codecs.register_error("surrogates_then_replace", surrogates_then_replace) That mitigates my arguments significantly, although I'd rather see something like errors=('surrogates', 'replace') chain the handlers without additional registrations. But that's a different PEP or arbitrary change. :)
The first version of PEP 293 changed the errors argument to be a string or callable. This would have simplified handler stacking somewhat (because you don't have to register or lookup handlers) but it had the disadvantage that many "char *" arguments in the C API would have had to changed to "PyObject *". Changing the errors argument to a list of strings would have the same problem.
A comma-separated or space-separated string, eg 'surrogates replace' or 'surrogates,replace'? It could be treated as handler stacking internally.
Well, there is a way to stack error handlers, although it's not pretty: [...] codecs.register_error("surrogates_then_replace", surrogates_then_replace)
That mitigates my arguments significantly, although I'd rather see something like errors=('surrogates', 'replace') chain the handlers without additional registrations. But that's a different PEP or arbitrary change. :)
I think you can provide something like errors=combine_errors('surrogates', 'replace') as a library function, and it doesn't have to be part of the standard library.
The stacking argument also applies to the new utf8b behavior on encode (only, as it handles all errors on decode). This may be a YAGNI Indeed - in particular, as, in the primary application of this error handler (i.e. file IO operations), there is no way of specifying an addition error handler anyway.
Would it be useful to allow setting this somewhere?
I'm deliberately not proposing this as part of the PEP. First, it has enough features already, and is approved as-is; plus YAGNI. Regards, Martin
I haven't come up with anything I like better than errors="lenient" for the old utf8 behavior handler; would errors="nonvalidating" be correct?
I think either is fairly unspecific.
For the utf8b error handler, I could see any of errors="roundtrip", errors="roundtripreplace", errors="tosurrogate", errors="surrogatereplace", errors="surrogateescape", errors="binaryreplace", errors="binaryescape". This includes Antoine's proposal (sans hyphen).
Giving multiple choices does not exactly make this proposal readily implementable :-) Regards, Martin
The error handler for undoing this operation (ie. when converting a Unicode string to some other encoding) should probably use the same name based on symmetry and the fact that the escaping scheme is meant to be used for enabling round-trip safety.
Could you please familiarize yourself with the implementation before commenting further? Thanks, Martin
Walter Dörwald writes:
"surrogatepass" (for the "don't complain about lone half surrogates" handler) and "surrogatereplace" sound OK to me. However the other "...replace" handlers are destructive (i.e. when such a "...replace" handler is used for encoding, decoding will not produce the original unicode string).
That doesn't bother me in the slightest. "Replace" does not connote "destructive" or "non-destructive" to me; it connotes "substitution". The fact that other error handlers happen to be destructive doesn't affect that at all for me. YMMV.
The purpose of the PEP 383 error handler however is to be roundtrip safe, so maybe we should choose a slightly different name? How about "surrogateescape"?
To me, "escape" has a strong connotation of a multicharacter representation of a single character, and that's not true here. How about "surrogatetranslate"? I still prefer "surrogatereplace", as it's slightly easier for me to type.
Martin v. Löwis wrote:
So are you proposing that I should rename the PEP 383 handler to "utf_8b_encoder_invalid_codepoints"?
No, he's saying that your algorithm for choosing the PEP 383 handler should have come up with that name, rather than utf8b. But since PEP 383 applies to other codecs besides UTF-8, it should have a different name. And one that is less cumbersome than "utf_8b_encoder_invalid_codepoints"
Correct. Thank you Glenn.
I'm still at a loss what name to give it, though. I understand that I have to rename both error handlers, but I'm uncertain what I should rename them to. So proposals that rename only one of them aren't that helpful. It would be helpful if people would indicate support for Antoine's proposal.
Given your explanation of what the new 'surrogates' handler does (pass rather than reject erroneous surrogates), I think 'surrogates_pass' is fine. Thus, I considoer that and 'surrogates_excape' the best proposal the best so far and suggest that you make this pair the current status quo to be argued against and improved ... or not. tjr
Given your explanation of what the new 'surrogates' handler does (pass rather than reject erroneous surrogates), I think 'surrogates_pass' is fine. Thus, I considoer that and 'surrogates_excape' the best proposal the best so far and suggest that you make this pair the current status quo to be argued against and improved ... or not.
That's exactly what I want to avoid: more bike-shedding. If this is now changed, it cannot be possibly be argued against and improved - it would be final, end of discussion (please!!!). So I'm happy to make it "surrogatepass" and "surrogateescape" as proposed by Walter. I'm sure you didn't really mean the spelling of "excape" to be taken literally - whether or not you meant the plural and the underscore literally, I cannot tell. Stephen Turnbull approved singular, so that's good enough for me. Regards, Martin
On Thu, May 7, 2009 at 12:39 PM, "Martin v. Löwis"
Given your explanation of what the new 'surrogates' handler does (pass rather than reject erroneous surrogates), I think 'surrogates_pass' is fine. Thus, I considoer that and 'surrogates_excape' the best proposal the best so far and suggest that you make this pair the current status quo to be argued against and improved ... or not.
That's exactly what I want to avoid: more bike-shedding. If this is now changed, it cannot be possibly be argued against and improved - it would be final, end of discussion (please!!!).
So I'm happy to make it "surrogatepass" and "surrogateescape" as proposed by Walter. I'm sure you didn't really mean the spelling of "excape" to be taken literally - whether or not you meant the plural and the underscore literally, I cannot tell. Stephen Turnbull approved singular, so that's good enough for me.
singular is good. +1 on these names.
Martin v. Löwis wrote:
Given your explanation of what the new 'surrogates' handler does (pass rather than reject erroneous surrogates), I think 'surrogates_pass' is fine. Thus, I considoer that and 'surrogates_excape' the best proposal the best so far and suggest that you make this pair the current status quo to be argued against and improved ... or not.
That's exactly what I want to avoid: more bike-shedding. If this is now changed, it cannot be possibly be argued against and improved - it would be final, end of discussion (please!!!).
So I'm happy to make it "surrogatepass" and "surrogateescape" as proposed by Walter. I'm sure you didn't really mean the spelling of "excape" to be taken literally - whether or not you meant the plural and the underscore literally, I cannot tell. Stephen Turnbull approved singular, so that's good enough for me.
Those minor tweaks for consistency with existing names are what I meant by 'improve' (with good arguments) and I approve of them also. +1 on stopping here.
Terry Reedy wrote:
Martin v. Löwis wrote:
Given your explanation of what the new 'surrogates' handler does (pass rather than reject erroneous surrogates), I think 'surrogates_pass' is fine. Thus, I considoer that and 'surrogates_excape' the best proposal the best so far and suggest that you make this pair the current status quo to be argued against and improved ... or not.
That's exactly what I want to avoid: more bike-shedding. If this is now changed, it cannot be possibly be argued against and improved - it would be final, end of discussion (please!!!).
So I'm happy to make it "surrogatepass" and "surrogateescape" as proposed by Walter. I'm sure you didn't really mean the spelling of "excape" to be taken literally - whether or not you meant the plural and the underscore literally, I cannot tell. Stephen Turnbull approved singular, so that's good enough for me.
Those minor tweaks for consistency with existing names are what I meant by 'improve' (with good arguments) and I approve of them also. +1 on stopping here.
We argue because we care. :-)
Martin v. Löwis wrote:
The error handler for undoing this operation (ie. when converting a Unicode string to some other encoding) should probably use the same name based on symmetry and the fact that the escaping scheme is meant to be used for enabling round-trip safety.
Could you please familiarize yourself with the implementation before commenting further?
I did and it already uses the same (wrong) name for both encoding and decoding handlers which is good. The reason for my above comment was that the thread mentions two different names for the handler depending on the direction, e.g. "surrogatereplace" and "surrogatepass". I guess that "surrogatepass" was just an attempt to find a new name for the "surrogates" error handler (which also doesn't match the naming scheme) and that got me confused. I'd use "allowlonesurrogates" as name for the "surrogates" error handler and "lonesurrogatereplace" for the "utf8b" one. Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 08 2009)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2009-06-29: EuroPython 2009, Birmingham, UK 51 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
On approximately 5/7/2009 3:27 PM, came the following characters from the keyboard of MRAB:
Terry Reedy wrote:
Martin v. Löwis wrote:
So I'm happy to make it "surrogatepass" and "surrogateescape" as
These seem adequate. It is not what I would choose or suggest, but it is adequate, and it is unlikely you can delight everyone with your choice of names, or even someone else's choice of names. These at least have a logical justification for their meaning, and can be documented reasonably. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
Stephen J. Turnbull wrote:
Walter Dörwald writes:
"surrogatepass" (for the "don't complain about lone half surrogates" handler) and "surrogatereplace" sound OK to me. However the other "...replace" handlers are destructive (i.e. when such a "...replace" handler is used for encoding, decoding will not produce the original unicode string).
That doesn't bother me in the slightest. "Replace" does not connote "destructive" or "non-destructive" to me; it connotes "substitution". The fact that other error handlers happen to be destructive doesn't affect that at all for me. YMMV.
The purpose of the PEP 383 error handler however is to be roundtrip safe, so maybe we should choose a slightly different name? How about "surrogateescape"?
To me, "escape" has a strong connotation of a multicharacter representation of a single character, and that's not true here.
How about "surrogatetranslate"? I still prefer "surrogatereplace", as it's slightly easier for me to type.
I like "surrogatetranslate" better than "surrogateescape" better than "surrogatereplace". But I'll stop bikesheding now and let Martin decide. Servus, alter
participants (18)
-
"Martin v. Löwis"
-
Antoine Pitrou
-
Glenn Linderman
-
Gregory P. Smith
-
James Y Knight
-
Lennart Regebro
-
Lino Mastrodomenico
-
M.-A. Lemburg
-
Michael Urman
-
MRAB
-
Paul Moore
-
R. David Murray
-
Stephen J. Turnbull
-
Stephen J. Turnbull
-
Terry Reedy
-
Walter Dörwald
-
Zooko O'Whielacronx
-
Zooko Wilcox-O'Hearn