[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Glenn Linderman
v+python at g.nevcal.com
Thu Apr 30 09:29:36 CEST 2009
On approximately 4/29/2009 8:46 PM, came the following characters from
the keyboard of Terry Reedy:
> Glenn Linderman wrote:
>> On approximately 4/29/2009 1:28 PM, came the following characters from
>
>>> So where is the ambiguity here?
>>
>> None. But not everyone can read all the Python source code to try to
>> understand it; they expect the documentation to help them avoid that.
>> Because the documentation is lacking in this area, it makes your
>> concisely stated PEP rather hard to understand.
>
> If you think a section of the doc is grossly inadequate, and there is no
> existing issue on the tracker, feel free to add one.
>
>> Thanks for clarifying the Windows behavior, here. A little more
>> clarification in the PEP could have avoided lots of discussion. It
>> would seem that a PEP, proposed to modify a poorly documented (and
>> therefore likely poorly understood) area, should be educational about
>> the status quo, as well as presenting the suggested change.
>
> Where the PEP proposes to change, it should start with the status quo.
> But Martin's somewhat reasonable position is that since he is not
> proposing to change behavior on Windows, it is not his responsibility to
> document what he is not proposing to change more adequately. This
> means, of course, that any observed change on Windows would then be a
> bug, or at least a break of the promise. On the other hand, I can see
> that this is enough related to what he is proposing to change that
> better doc would help.
Yes; the very fact that the PEP discusses Windows, speaks about
cross-platform code, and doesn't explicitly state that no Windows
functionality will change, is confusing.
An example of how to initialize things within a sample cross-platform
application might help, especially if that initialization only happens
if the platform is POSIX, or is commented to the effect that it has no
effect on Windows, but makes POSIX happy. Or maybe it is all buried
within the initialization of Python itself, and is not exposed to the
application at all. I still haven't figured that out, but was not (and
am still not) as concerned about that as ensuring that the overall
algorithms are functional and useful and user-friendly. Showing it
might have been helpful in making it clear that no Windows functionality
would change, however.
A statement that additional features are being added to allow
cross-platform programs deal with non-decodable bytes obtained from
POSIX APIs using the same code that already works on Windows, would have
made things much clearer. The present Abstract does, in fact, talk only
about POSIX, but later statements about Windows muddy the water.
Rationale paragraph 3, explicitly talks about cross-platform programs
needing to work one way on Windows and another way on POSIX to deal with
all the cases. It calls that a proposal, which I guess it is for
command line and environment, but it is already implemented in both
bytes and str forms for file names... so that further muddies the water.
It is, of course, easier to point out deficiencies in a document than to
write a better document; however, it is incumbent upon the PEP author to
write a PEP that is good enough to get approved, and that means making
it understandable enough that people are in favor... or to respond to
the plethora of comments until people are in favor. I'm not sure which
one is more time-consuming.
I've reached the point, based on PEP and comment responses, where I now
believe that the PEP is a solution to the problem it is trying to solve,
and doesn't create ambiguities in the naming. I don't believe it is the
best solution.
The basic problem is the overuse of fake characters... normalizing them
for display results is large data loss -- many characters would be
translated to the same replacement characters.
Solutions exist that would allow the use of fewer different fake
characters in the strings, while still having a fake character as the
escape character, to preserve the invariant that all the strings
manipulated by python-escape from the PEP were, and become, strings
containing fake characters (from a strict Unicode perspective), which is
a nice invariant*. There even exist solutions that would use only one
fake character (repeatedly if necessary), and all other characters
generated would be displayable characters. This would ease the burden
on the program in displaying the strings, and also on the user that
might view the resulting mojibake in trying to differentiate one such
string from another. Those are outlined in various emails in this
thread, although some include my misconception that strings obtained via
Unicode-enabled OS APIs would also need to be encoded and altered. If
there is any interest in using a more readable encoding, I'd be glad to
rework them to remove those misconceptions.
* It would be nice to point out that invariant in the PEP, also.
--
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
More information about the Python-Dev
mailing list