[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Tue Apr 28 21:07:54 CEST 2009

On approximately 4/28/2009 10:53 AM, came the following characters from 
the keyboard of James Y Knight:
> 
> On Apr 28, 2009, at 2:50 AM, Martin v. Löwis wrote:
> 
>> James Y Knight wrote:
>>> Hopefully it can be assumed that your locale encoding really is a
>>> non-overlapping superset of ASCII, as is required by POSIX...
>>
>> Can you please point to the part of the POSIX spec that says that
>> such overlapping is forbidden?
> 
> I can't find it...I would've thought it would be on this page:
> http://opengroup.org/onlinepubs/007908775/xbd/charset.html
> but it's not (at least, not obviously). That does say (effectively) that 
> all encodings must be supersets of ASCII and use the same codepoints, 
> though.
> 
> However, ISO-2022 being inappropriate for LC_CTYPE usage is the entire 
> reason why EUC-JP was created, so I'm pretty sure that it is in fact 
> inappropriate, and I cannot find any evidence of it ever being used on 
> any system.

It would seem from the definition of ISO-2022 that what it calls "escape 
sequences" is in your POSIX spec called "locking-shift encoding". 
Therefore, the second bullet item under the "Character Encoding" heading 
prohibits use of ISO-2022, for whatever uses that document defines 
(which, since you referenced it, I assume means locales, and possibly 
file system encodings, but I'm not familiar with the structure of all 
the POSIX standards documents).

A locking-shift encoding (where the state of the character is determined 
by a shift code that may affect more than the single character following 
it) cannot be defined with the current character set description file 
format. Use of a locking-shift encoding with any of the standard 
utilities in the XCU specification or with any of the functions in the 
XSH specification that do not specifically mention the effects of 
state-dependent encoding is implementation-dependent.

>  From http://en.wikipedia.org/wiki/EUC-JP:
> "To get the EUC form of an ISO-2022 character, the most significant bit 
> of each 7-bit byte of the original ISO 2022 codes is set (by adding 128 
> to each of these original 7-bit codes); this allows software to easily 
> distinguish whether a particular byte in a character string belongs to 
> the ISO-646 code or the ISO-2022 (EUC) code."
> 
> Also:
> http://www.cl.cam.ac.uk/~mgk25/ucs/iso2022-wc.html
> 
> 
>>> I'm a bit scared at the prospect that U+DCAF could turn into "/", that
>>> just screams security vulnerability to me.  So I'd like to propose that
>>> only 0x80-0xFF <-> U+DC80-U+DCFF should ever be allowed to be
>>> encoded/decoded via the error handler.
>>
>> It would be actually U+DC2f that would turn into /.
> 
> Yes, I meant to say DC2F, sorry for the confusion.
> 
>> I'm happy to exclude that range from the mapping if POSIX really
>> requires an encoding not to be overlapping with ASCII.
> 
> I think it has to be excluded from mapping in order to not introduce 
> security issues.
> 
> However...
> 
> There's also SHIFT-JIS to worry about...which apparently some people 
> actually want to use as their default encoding, despite it being broken 
> to do so. RedHat apparently refuses to provide it as a locale charset 
> (due to its brokenness), and it's also not available by default on my 
> Debian system. People do unfortunately seem to actually use it in real 
> life.
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=136290
> 
> So, I'd like to propose this:
> The "python-escape" error handler when given a non-decodable byte from 
> 0x80 to 0xFF will produce values of U+DC80 to U+DCFF. When given a 
> non-decodable byte from 0x00 to 0x7F, it will be converted to 
> U+0000-U+007F. On the encoding side, values from U+DC80 to U+DCFF are 
> encoded into 0x80 to 0xFF, and all other characters are treated in 
> whatever way the encoding would normally treat them.
> 
> This proposal obviously works for all non-overlapping ASCII supersets, 
> where 0x00 to 0x7F always decode to U+00 to U+7F. But it also works for 
> Shift-JIS and other similar ASCII-supersets with overlaps in trailing 
> bytes of a multibyte sequence. So, a sequence like 
> "\x81\xFD".decode("shift-jis", "python-escape") will turn into 
> u"\uDC81\u00fd". Which will then properly encode back into "\x81\xFD".
> 
> The character sets this *doesn't* work for are: ebcdic code pages 
> (obviously completely unsuitable for a locale encoding on unix), 

Why is that obvious?  The only thing I saw that could exclude EBCDIC 
would be the requirement that the codes be positive in a char, but on a 
system where the C compiler treats char as unsigned, EBCDIC would qualify.

Of course, the use of EBCDIC would also restrict the other possible code 
pages to those derived from EBCDIC (rather than the bulk of code pages 
that are derived from ASCII), due to:

If the encoded values associated with each member of the portable 
character set are not invariant across all locales supported by the 
implementation, the results achieved by an application accessing those 
locales are unspecified.

> iso2022-* (covered above), and shift-jisx0213 (because it has replaced \ 
> with yen, and - with overline).
> 
> If it's desirable to work with shift_jisx0213, a modification of the 
> proposal can be made: Change the second sentence to: "When given a 
> non-decodable byte from 0x00 to 0x7F, that byte must be the second or 
> later byte in a multibyte sequence. In such a case, the error handler 
> will produce the encoding of that byte if it was standing alone (thus in 
> most encodings, \x00-\x7f turn into U+00-U+7F)."
> 
> It sounds from https://bugzilla.novell.com/show_bug.cgi?id=162501 like 
> some people do actually use shift_jisx0213, unfortunately.

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking