[Python-3000] PEP 3131 - the details
James Y Knight
foom at fuhm.net
Thu May 17 20:03:54 CEST 2007
I mentioned this in another thread as an aside in the middle of the
email, but I thought I'd put it out here at the top:
It should be considered whether formatting characters should be
ignored. And if so, which list of properties should be used for that.
I notice that the excerpt from the C# standard says:
> * 4 Any formatting-characters are removed.
I don't know what they mean by that, but I'm going to guess
characters in the Cf class.
However, UAX #31 says:
> 2.2 Layout and Format Control Characters
>
> Certain Unicode characters are used to control joining behavior,
> bidirectional ordering control, and alternative formats for
> display. These have the General_Category value of Cf. Unlike space
> characters or other delimiters, they do not indicate word, line, or
> other unit boundaries.
>
> While it is possible to ignore these characters in determining
> identifiers, the recommendation is to not ignore them and to not
> permit them in identifiers except in special cases. This is because
> of the possibility for confusion between two visually identical
> strings; see [UTR36]. Some possible exceptions are the ZWJ and ZWNJ
> in certain contexts, such as between certain characters in Indic
> words.
It doesn't seem to me that an attack vector here is particularly
relevant, so perhaps going along with C# and ignoring Cf characters
in the source code might be a good idea. But I do notice that Unicode
4.0.1 and earlier used to recommend ignoring formatting characters in
identifiers (Ch 5 of the book), so that might be where C# got it from.
So, maybe it's better to keep the status quo, and not allow Cf
characters, unless someone comes up with a particular need for doing
so. Hm, I think I've convinced myself of that now. :)
James
More information about the Python-3000
mailing list