[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

Glenn Linderman v+python at g.nevcal.com
Thu Apr 30 09:29:36 CEST 2009


On approximately 4/29/2009 8:46 PM, came the following characters from 
the keyboard of Terry Reedy:
> Glenn Linderman wrote:
>> On approximately 4/29/2009 1:28 PM, came the following characters from 
> 
>>> So where is the ambiguity here?
>>
>> None.  But not everyone can read all the Python source code to try to 
>> understand it; they expect the documentation to help them avoid that. 
>> Because the documentation is lacking in this area, it makes your 
>> concisely stated PEP rather hard to understand.
> 
> If you think a section of the doc is grossly inadequate, and there is no 
> existing issue on the tracker, feel free to add one.
> 
>> Thanks for clarifying the Windows behavior, here.  A little more 
>> clarification in the PEP could have avoided lots of discussion.  It 
>> would seem that a PEP, proposed to modify a poorly documented (and 
>> therefore likely poorly understood) area, should be educational about 
>> the status quo, as well as presenting the suggested change.
> 
> Where the PEP proposes to change, it should start with the status quo. 
> But Martin's somewhat reasonable position is that since he is not 
> proposing to change behavior on Windows, it is not his responsibility to 
> document what he is not proposing to change more adequately.  This 
> means, of course, that any observed change on Windows would then be a 
> bug, or at least a break of the promise.  On the other hand, I can see 
> that this is enough related to what he is proposing to change that 
> better doc would help.


Yes; the very fact that the PEP discusses Windows, speaks about 
cross-platform code, and doesn't explicitly state that no Windows 
functionality will change, is confusing.

An example of how to initialize things within a sample cross-platform 
application might help, especially if that initialization only happens 
if the platform is POSIX, or is commented to the effect that it has no 
effect on Windows, but makes POSIX happy.  Or maybe it is all buried 
within the initialization of Python itself, and is not exposed to the 
application at all.  I still haven't figured that out, but was not (and 
am still not) as concerned about that as ensuring that the overall 
algorithms are functional and useful and user-friendly.  Showing it 
might have been helpful in making it clear that no Windows functionality 
would change, however.

A statement that additional features are being added to allow 
cross-platform programs deal with non-decodable bytes obtained from 
POSIX APIs using the same code that already works on Windows, would have 
made things much clearer.  The present Abstract does, in fact, talk only 
about POSIX, but later statements about Windows muddy the water.

Rationale paragraph 3, explicitly talks about cross-platform programs 
needing to work one way on Windows and another way on POSIX to deal with 
all the cases.  It calls that a proposal, which I guess it is for 
command line and environment, but it is already implemented in both 
bytes and str forms for file names... so that further muddies the water.

It is, of course, easier to point out deficiencies in a document than to 
write a better document; however, it is incumbent upon the PEP author to 
write a PEP that is good enough to get approved, and that means making 
it understandable enough that people are in favor... or to respond to 
the plethora of comments until people are in favor.  I'm not sure which 
one is more time-consuming.

I've reached the point, based on PEP and comment responses, where I now 
believe that the PEP is a solution to the problem it is trying to solve, 
and doesn't create ambiguities in the naming.  I don't believe it is the 
best solution.

The basic problem is the overuse of fake characters... normalizing them 
for display results is large data loss -- many characters would be 
translated to the same replacement characters.

Solutions exist that would allow the use of fewer different fake 
characters in the strings, while still having a fake character as the 
escape character, to preserve the invariant that all the strings 
manipulated by python-escape from the PEP were, and become, strings 
containing fake characters (from a strict Unicode perspective), which is 
a nice invariant*.  There even exist solutions that would use only one 
fake character (repeatedly if necessary), and all other characters 
generated would be displayable characters.  This would ease the burden 
on the program in displaying the strings, and also on the user that 
might view the resulting mojibake in trying to differentiate one such 
string from another.  Those are outlined in various emails in this 
thread, although some include my misconception that strings obtained via 
  Unicode-enabled OS APIs would also need to be encoded and altered.  If 
there is any interest in using a more readable encoding, I'd be glad to 
rework them to remove those misconceptions.

* It would be nice to point out that invariant in the PEP, also.


-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking


More information about the Python-Dev mailing list