[Python-3000] pep 3131 again

Thu May 17 20:42:56 CEST 2007

> 2.  Python forbids these characters.  Martin, JavaScript
> treats these specially, and I think Python probably
> should, too:
> 
> The ECMAScript 3 standard for JavaScript requires the
> tokenizer to throw away all Unicode format-control characters
> (general category Cf).
> 
> ECMAScript 4 will likely tweak this (an incompatible change)
> to retain those characters only in strings and regexps.
> I like that better.

I've added this as an open issue. It would be easy to add,
but I would like to get some confirmation first that it
actually helps writers of the RTL languages (preferably
from some native speakers).

The proposed change would be that Cf characters would be
allowed *only* in and immediately around identifiers, and
in string literals and comments, i.e. the scanner would
work this way:

- perform token classification only based on individual
  ASCII letters; classify all non-ASCII letters as potential
  identifiers.
- for identifiers potential identifiers (i.e. runs of
  non-ASCII characters and ASCII letters, digits, and
  underscore), drop Cf characters, then verify identifier
  syntax.

IOW, you couldn't put the formatting characters around
whitespace, keywords, or punctuation.

An alternative implementation would be to drop formatting
characters everywhere except in string literals.

I'll repeat that UTR#39 explicitly discourages support
for formatting characters in identifiers.

Regards,
Martin