[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

MRAB google at mrabarnett.plus.com
Tue Apr 28 20:55:09 CEST 2009


James Y Knight wrote:
> 
> On Apr 28, 2009, at 2:50 AM, Martin v. Löwis wrote:
> 
>> James Y Knight wrote:
>>> Hopefully it can be assumed that your locale encoding really is a
>>> non-overlapping superset of ASCII, as is required by POSIX...
>>
>> Can you please point to the part of the POSIX spec that says that
>> such overlapping is forbidden?
> 
> I can't find it...I would've thought it would be on this page:
> http://opengroup.org/onlinepubs/007908775/xbd/charset.html
> but it's not (at least, not obviously). That does say (effectively) that 
> all encodings must be supersets of ASCII and use the same codepoints, 
> though.
> 
> However, ISO-2022 being inappropriate for LC_CTYPE usage is the entire 
> reason why EUC-JP was created, so I'm pretty sure that it is in fact 
> inappropriate, and I cannot find any evidence of it ever being used on 
> any system.
> 
>  From http://en.wikipedia.org/wiki/EUC-JP:
> "To get the EUC form of an ISO-2022 character, the most significant bit 
> of each 7-bit byte of the original ISO 2022 codes is set (by adding 128 
> to each of these original 7-bit codes); this allows software to easily 
> distinguish whether a particular byte in a character string belongs to 
> the ISO-646 code or the ISO-2022 (EUC) code."
> 
> Also:
> http://www.cl.cam.ac.uk/~mgk25/ucs/iso2022-wc.html
> 
> 
>>> I'm a bit scared at the prospect that U+DCAF could turn into "/", that
>>> just screams security vulnerability to me.  So I'd like to propose that
>>> only 0x80-0xFF <-> U+DC80-U+DCFF should ever be allowed to be
>>> encoded/decoded via the error handler.
>>
>> It would be actually U+DC2f that would turn into /.
> 
> Yes, I meant to say DC2F, sorry for the confusion.
> 
>> I'm happy to exclude that range from the mapping if POSIX really
>> requires an encoding not to be overlapping with ASCII.
> 
> I think it has to be excluded from mapping in order to not introduce 
> security issues.
> 
> However...
> 
> There's also SHIFT-JIS to worry about...which apparently some people 
> actually want to use as their default encoding, despite it being broken 
> to do so. RedHat apparently refuses to provide it as a locale charset 
> (due to its brokenness), and it's also not available by default on my 
> Debian system. People do unfortunately seem to actually use it in real 
> life.
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=136290
> 
> So, I'd like to propose this:
> The "python-escape" error handler when given a non-decodable byte from 
> 0x80 to 0xFF will produce values of U+DC80 to U+DCFF. When given a 
> non-decodable byte from 0x00 to 0x7F, it will be converted to 
> U+0000-U+007F. On the encoding side, values from U+DC80 to U+DCFF are 
> encoded into 0x80 to 0xFF, and all other characters are treated in 
> whatever way the encoding would normally treat them.
> 
> This proposal obviously works for all non-overlapping ASCII supersets, 
> where 0x00 to 0x7F always decode to U+00 to U+7F. But it also works for 
> Shift-JIS and other similar ASCII-supersets with overlaps in trailing 
> bytes of a multibyte sequence. So, a sequence like 
> "\x81\xFD".decode("shift-jis", "python-escape") will turn into 
> u"\uDC81\u00fd". Which will then properly encode back into "\x81\xFD".
> 
> The character sets this *doesn't* work for are: ebcdic code pages 
> (obviously completely unsuitable for a locale encoding on unix), 
> iso2022-* (covered above), and shift-jisx0213 (because it has replaced \ 
> with yen, and - with overline).
> 
> If it's desirable to work with shift_jisx0213, a modification of the 
> proposal can be made: Change the second sentence to: "When given a 
> non-decodable byte from 0x00 to 0x7F, that byte must be the second or 
> later byte in a multibyte sequence. In such a case, the error handler 
> will produce the encoding of that byte if it was standing alone (thus in 
> most encodings, \x00-\x7f turn into U+00-U+7F)."
> 
> It sounds from https://bugzilla.novell.com/show_bug.cgi?id=162501 like 
> some people do actually use shift_jisx0213, unfortunately.
> 
I've been thinking of "python-escape" only in terms of UTF-8, the only
encoding mentioned in the PEP. In UTF-8, bytes 0x00 to 0x7F are
decodable.

But if you're talking about using it with other encodings, eg
shift-jisx0213, then I'd suggest the following:

1. Bytes 0x00 to 0xFF which can't normally be decoded are decoded to
half surrogates U+DC00 to U+DCFF.

2. Bytes which would have decoded to half surrogates U+DC00 to U+DCFF
are treated as though they are undecodable bytes.

3. Half surrogates U+DC00 to U+DCFF which can be produced by decoding
are encoded to bytes 0x00 to 0xFF.

4. Codepoints, including half surrogates U+DC00 to U+DCFF, which can't
be produced by decoding raise an exception.

I think I've covered all the possibilities. :-)


More information about the Python-Dev mailing list