[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

James Y Knight foom at fuhm.net
Tue Apr 28 19:53:42 CEST 2009

On Apr 28, 2009, at 2:50 AM, Martin v. Löwis wrote:

> James Y Knight wrote:
>> Hopefully it can be assumed that your locale encoding really is a
>> non-overlapping superset of ASCII, as is required by POSIX...
> Can you please point to the part of the POSIX spec that says that
> such overlapping is forbidden?

I can't find it...I would've thought it would be on this page:
but it's not (at least, not obviously). That does say (effectively)  
that all encodings must be supersets of ASCII and use the same  
codepoints, though.

However, ISO-2022 being inappropriate for LC_CTYPE usage is the entire  
reason why EUC-JP was created, so I'm pretty sure that it is in fact  
inappropriate, and I cannot find any evidence of it ever being used on  
any system.

 From http://en.wikipedia.org/wiki/EUC-JP:
"To get the EUC form of an ISO-2022 character, the most significant  
bit of each 7-bit byte of the original ISO 2022 codes is set (by  
adding 128 to each of these original 7-bit codes); this allows  
software to easily distinguish whether a particular byte in a  
character string belongs to the ISO-646 code or the ISO-2022 (EUC)  


>> I'm a bit scared at the prospect that U+DCAF could turn into "/",  
>> that
>> just screams security vulnerability to me.  So I'd like to propose  
>> that
>> only 0x80-0xFF <-> U+DC80-U+DCFF should ever be allowed to be
>> encoded/decoded via the error handler.
> It would be actually U+DC2f that would turn into /.

Yes, I meant to say DC2F, sorry for the confusion.

> I'm happy to exclude that range from the mapping if POSIX really
> requires an encoding not to be overlapping with ASCII.

I think it has to be excluded from mapping in order to not introduce  
security issues.


There's also SHIFT-JIS to worry about...which apparently some people  
actually want to use as their default encoding, despite it being  
broken to do so. RedHat apparently refuses to provide it as a locale  
charset (due to its brokenness), and it's also not available by  
default on my Debian system. People do unfortunately seem to actually  
use it in real life.


So, I'd like to propose this:
The "python-escape" error handler when given a non-decodable byte from  
0x80 to 0xFF will produce values of U+DC80 to U+DCFF. When given a non- 
decodable byte from 0x00 to 0x7F, it will be converted to U+0000-U 
+007F. On the encoding side, values from U+DC80 to U+DCFF are encoded  
into 0x80 to 0xFF, and all other characters are treated in whatever  
way the encoding would normally treat them.

This proposal obviously works for all non-overlapping ASCII supersets,  
where 0x00 to 0x7F always decode to U+00 to U+7F. But it also works  
for Shift-JIS and other similar ASCII-supersets with overlaps in  
trailing bytes of a multibyte sequence. So, a sequence like  
"\x81\xFD".decode("shift-jis", "python-escape") will turn into  
u"\uDC81\u00fd". Which will then properly encode back into "\x81\xFD".

The character sets this *doesn't* work for are: ebcdic code pages  
(obviously completely unsuitable for a locale encoding on unix),  
iso2022-* (covered above), and shift-jisx0213 (because it has replaced  
\ with yen, and - with overline).

If it's desirable to work with shift_jisx0213, a modification of the  
proposal can be made: Change the second sentence to: "When given a non- 
decodable byte from 0x00 to 0x7F, that byte must be the second or  
later byte in a multibyte sequence. In such a case, the error handler  
will produce the encoding of that byte if it was standing alone (thus  
in most encodings, \x00-\x7f turn into U+00-U+7F)."

It sounds from https://bugzilla.novell.com/show_bug.cgi?id=162501 like  
some people do actually use shift_jisx0213, unfortunately.


More information about the Python-Dev mailing list