Re: [Python-Dev] PEP 393 Summer of Code Project

25 Aug 2011


      Am 25.08.2011 11:39, schrieb Stephen J. Turnbull:
...
"Martin v. Löwis" writes:
...
No, that's explicitly *not* what C6 says. Instead, it says that a
process that treats s1 and s2 differently shall not assume that others
will do the same, i.e. that it is ok to treat them the same even though
they have different code points. Treating them differently is also
conforming.
Then what requirement does C6 impose, in your opinion?
In IETF terminology, it's a weak SHOULD requirement. Unless there are
reasons not to, equivalent strings should be treated differently. It's
a weak requirement because the reasons not to treat them equivalent are
wide-spread.
...
- Ideally, an implementation would *always* interpret two
  canonical-equivalent sequences *identically*.  There are practical
  circumstances under which implementations may reasonably distinguish
  them.  (Emphasis mine.)
Ok, so let me put emphasis on *ideally*. They acknowledge that for
practical reasons, the equivalent strings may need to be
distinguished.
...
The examples given are things like "inspecting memory representation
structure" (which properly speaking is really outside of Unicode
conformance) and "ignoring collation behavior of combining sequences
outside the repertoire of a specified language."  That sounds like
"Special cases aren't special enough to break the rules. Although
practicality beats purity." to me.  Treating things differently is an
exceptional case, that requires sufficient justification.
And the common justification is efficiency, along with the desire
to support the representation of unnormalized strings (else there
would be an efficient implementation).
...
If our process is working with an external process (the OS's file
system driver) whose definition includes the statement that "File
names are sequences of Unicode characters", then C6 says our process
must compare canonically equivalent sequences that it takes to be file
names as the same, whether or not they are in the same normalized
form, or normalized at all, because we can't assume the file system
will treat them as different.
It may well happen that this requirement is met in a plain Python
application. If the file system and GUI libraries always return
NFD strings, then the Python process *will* compare equivalent
sequences correctly (since it won't ever get any other
representations).
...
*Users* will certainly take the viewpoint that two strings that
display the same on their monitor should identify the same file when
they use them as file names.
Yes, but that's the operating system's choice first of all.
Some operating systems do allow file names in a single directory
that are equivalent yet use different code points. Python then
needs to support this operating system, despite the permission of the
Unicode standard to ignore the difference.
...
I'm simply saying that the current
implementation of strings, as improved by PEP 393, can not be said to
be conforming.
I continue to disagree. The Unicode standard deliberately allows
Python's behavior as conforming.
...
I would like to see something much more conformant done as a separate
library (the Python Components for Unicode, say), intended to support
users who need character-based behavior, Unicode-ly correct collation,
etc., more than efficiency.
Wrt. normalization, I think all that's needed is already there.
Applications just need to normalize all strings to a normal form of
their liking, and be done. That's easier than using a separate library
throughout the code base (let alone using yet another string type).

Regards,
Martin