[Python-Dev] PEP 393 Summer of Code Project

Thu Aug 25 11:57:53 CEST 2011

Am 25.08.2011 11:39, schrieb Stephen J. Turnbull:
> "Martin v. Löwis" writes:
> 
>  > No, that's explicitly *not* what C6 says. Instead, it says that a
>  > process that treats s1 and s2 differently shall not assume that others
>  > will do the same, i.e. that it is ok to treat them the same even though
>  > they have different code points. Treating them differently is also
>  > conforming.
> 
> Then what requirement does C6 impose, in your opinion? 

In IETF terminology, it's a weak SHOULD requirement. Unless there are
reasons not to, equivalent strings should be treated differently. It's
a weak requirement because the reasons not to treat them equivalent are
wide-spread.

> - Ideally, an implementation would *always* interpret two
>   canonical-equivalent sequences *identically*.  There are practical
>   circumstances under which implementations may reasonably distinguish
>   them.  (Emphasis mine.)

Ok, so let me put emphasis on *ideally*. They acknowledge that for
practical reasons, the equivalent strings may need to be
distinguished.

> The examples given are things like "inspecting memory representation
> structure" (which properly speaking is really outside of Unicode
> conformance) and "ignoring collation behavior of combining sequences
> outside the repertoire of a specified language."  That sounds like
> "Special cases aren't special enough to break the rules. Although
> practicality beats purity." to me.  Treating things differently is an
> exceptional case, that requires sufficient justification.

And the common justification is efficiency, along with the desire
to support the representation of unnormalized strings (else there
would be an efficient implementation).

> If our process is working with an external process (the OS's file
> system driver) whose definition includes the statement that "File
> names are sequences of Unicode characters", then C6 says our process
> must compare canonically equivalent sequences that it takes to be file
> names as the same, whether or not they are in the same normalized
> form, or normalized at all, because we can't assume the file system
> will treat them as different.

It may well happen that this requirement is met in a plain Python
application. If the file system and GUI libraries always return
NFD strings, then the Python process *will* compare equivalent
sequences correctly (since it won't ever get any other
representations).

> *Users* will certainly take the viewpoint that two strings that
> display the same on their monitor should identify the same file when
> they use them as file names.

Yes, but that's the operating system's choice first of all.
Some operating systems do allow file names in a single directory
that are equivalent yet use different code points. Python then
needs to support this operating system, despite the permission of the
Unicode standard to ignore the difference.

> I'm simply saying that the current
> implementation of strings, as improved by PEP 393, can not be said to
> be conforming.

I continue to disagree. The Unicode standard deliberately allows
Python's behavior as conforming.

> I would like to see something much more conformant done as a separate
> library (the Python Components for Unicode, say), intended to support
> users who need character-based behavior, Unicode-ly correct collation,
> etc., more than efficiency.

Wrt. normalization, I think all that's needed is already there.
Applications just need to normalize all strings to a normal form of
their liking, and be done. That's easier than using a separate library
throughout the code base (let alone using yet another string type).

Regards,
Martin