Am 25.08.2011 11:39, schrieb Stephen J. Turnbull:
"Martin v. Löwis" writes:
No, that's explicitly *not* what C6 says. Instead, it says that a process that treats s1 and s2 differently shall not assume that others will do the same, i.e. that it is ok to treat them the same even though they have different code points. Treating them differently is also conforming.
Then what requirement does C6 impose, in your opinion?
In IETF terminology, it's a weak SHOULD requirement. Unless there are reasons not to, equivalent strings should be treated differently. It's a weak requirement because the reasons not to treat them equivalent are wide-spread.
- Ideally, an implementation would *always* interpret two canonical-equivalent sequences *identically*. There are practical circumstances under which implementations may reasonably distinguish them. (Emphasis mine.)
Ok, so let me put emphasis on *ideally*. They acknowledge that for practical reasons, the equivalent strings may need to be distinguished.
The examples given are things like "inspecting memory representation structure" (which properly speaking is really outside of Unicode conformance) and "ignoring collation behavior of combining sequences outside the repertoire of a specified language." That sounds like "Special cases aren't special enough to break the rules. Although practicality beats purity." to me. Treating things differently is an exceptional case, that requires sufficient justification.
And the common justification is efficiency, along with the desire to support the representation of unnormalized strings (else there would be an efficient implementation).
If our process is working with an external process (the OS's file system driver) whose definition includes the statement that "File names are sequences of Unicode characters", then C6 says our process must compare canonically equivalent sequences that it takes to be file names as the same, whether or not they are in the same normalized form, or normalized at all, because we can't assume the file system will treat them as different.
It may well happen that this requirement is met in a plain Python application. If the file system and GUI libraries always return NFD strings, then the Python process *will* compare equivalent sequences correctly (since it won't ever get any other representations).
*Users* will certainly take the viewpoint that two strings that display the same on their monitor should identify the same file when they use them as file names.
Yes, but that's the operating system's choice first of all. Some operating systems do allow file names in a single directory that are equivalent yet use different code points. Python then needs to support this operating system, despite the permission of the Unicode standard to ignore the difference.
I'm simply saying that the current implementation of strings, as improved by PEP 393, can not be said to be conforming.
I continue to disagree. The Unicode standard deliberately allows Python's behavior as conforming.
I would like to see something much more conformant done as a separate library (the Python Components for Unicode, say), intended to support users who need character-based behavior, Unicode-ly correct collation, etc., more than efficiency.
Wrt. normalization, I think all that's needed is already there. Applications just need to normalize all strings to a normal form of their liking, and be done. That's easier than using a separate library throughout the code base (let alone using yet another string type). Regards, Martin