[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
v+python at g.nevcal.com
Thu Apr 30 09:58:16 CEST 2009
On approximately 4/29/2009 7:50 PM, came the following characters from
the keyboard of Aahz:
> On Thu, Apr 30, 2009, Cameron Simpson wrote:
>> The lengthy discussion mostly revolves around:
>> - Glenn points out that strings that came _not_ from listdir, and that are
>> _not_ well-formed unicode (== "have bare surrogates in them") but that
>> were intended for use as filenames will conflict with the PEP's scheme -
>> programs must know that these strings came from outside and must be
>> translated into the PEP's funny-encoding before use in the os.*
>> functions. Previous to the PEP they would get used directly and
>> encode differently after the PEP, thus producing different POSIX
>> filenames. Breakage.
>> - Glenn would like the encoding to use Unicode scalar values only,
>> using a rare-in-filenames character.
>> That would avoid the issue with "outside' strings that contain
>> surrogates. To my mind it just moves the punning from rare illegal
>> strings to merely uncommon but legal characters.
>> - Some parties think it would be better to not return strings from
>> os.listdir but a subclass of string (or at least a duck-type of
>> string) that knows where it came from and is also handily
>> recognisable as not-really-a-string for purposes of deciding
>> whether is it PEP-funny-encoded by direct inspection.
> Assuming people agree that this is an accurate summary, it should be
> incorporated into the PEP.
I'll agree that once other misconceptions were explained away, that the
remaining issues are those Cameron summarized. Thanks for the summary!
Point two could be modified because I've changed my opinion; I like the
invariant Cameron first (I think) explicitly stated about the PEP as it
stands, and that I just reworded in another message, that the strings
that are altered by the PEP in either direction are in the subset of
strings that contain fake (from a strict Unicode viewpoint) characters.
I still think an encoding that uses mostly real characters that have
assigned glyphs would be better than the encoding in the PEP; but would
now suggest that an escape character be a fake character.
I'll note here that while the PEP encoding causes illegal bytes to be
translated to one fake character, the 3-byte sequence that looks like
the range of fake characters would also be translated to a sequence of 3
fake characters. This is 512 combinations that must be translated, and
understood by the user (or at least by the programmer). The "escape
sequence" approach requires changing only 257 combinations, and each
altered combination would result in exactly 2 characters. Hence, this
seems simpler to understand, and to manually encode and decode for
Glenn -- http://nevcal.com/
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
More information about the Python-Dev