[Python-3000] PEP 3131 - the details

James Y Knight foom at fuhm.net
Thu May 17 20:03:54 CEST 2007


I mentioned this in another thread as an aside in the middle of the  
email, but I thought I'd put it out here at the top:

It should be considered whether formatting characters should be  
ignored. And if so, which list of properties should be used for that.

I notice that the excerpt from the C# standard says:
>     * 4 Any formatting-characters are removed.

I don't know what they mean by that, but I'm going to guess  
characters in the Cf class.

However, UAX #31 says:
> 2.2 Layout and Format Control Characters
>
> Certain Unicode characters are used to control joining behavior,  
> bidirectional ordering control, and alternative formats for  
> display. These have the General_Category value of Cf. Unlike space  
> characters or other delimiters, they do not indicate word, line, or  
> other unit boundaries.
>
> While it is possible to ignore these characters in determining  
> identifiers, the recommendation is to not ignore them and to not  
> permit them in identifiers except in special cases. This is because  
> of the possibility for confusion between two visually identical  
> strings; see [UTR36]. Some possible exceptions are the ZWJ and ZWNJ  
> in certain contexts, such as between certain characters in Indic  
> words.

It doesn't seem to me that an attack vector here is particularly  
relevant, so perhaps going along with C# and ignoring Cf characters  
in the source code might be a good idea. But I do notice that Unicode  
4.0.1 and earlier used to recommend ignoring formatting characters in  
identifiers (Ch 5 of the book), so that might be where C# got it from.

So, maybe it's better to keep the status quo, and not allow Cf  
characters, unless someone comes up with a particular need for doing  
so. Hm, I think I've convinced myself of that now. :)

James


More information about the Python-3000 mailing list