[Python-3000] Support for PEP 3131

"Martin v. Löwis" martin at v.loewis.de
Tue May 22 07:00:52 CEST 2007


> If I do arbitrary introspection, such as
> 
>    import sys
>    for k, v in sys.modules:
>        if v:
>            print dir(v)
> 
> then I will get something usable, though perhaps not easily readable.

I think this is unacceptable (at least I cannot accept it):
with reflection, I want to get the *true* variable names, not
the mangled ones. In the scenario that people had discussed with
using long Japanese method names for test methods, if the method
fails, you clearly want to see the Japanese name, so you can
easily read what failed.

> The mapping is reversible, so I can work interactively with the
> arbitrary characters by setting my console/idle preferences to the
> special encoding.

That could work both ways, of course. If you want a reflective
API to give you mangled names, you could easily implement that
yourself on top of PEP 3131.

>> Again, I'm uncertain what the use case here would be. For "proper"
>> transliteration, users can memorize easily what the transliterated
>> name would be, and visually identify the two representations.
> 
> For two latin-based alphabets, yes.  I'm not so sure for non-western
> scripts.

I know that the Chinese regularly use pinyin for transliteration,
and somebody confirmed in c.l.p that they also use it in programming
if they can't use the Chinese characters directly.

> As you pointed out, the correct transliteration may depend on the
> natural language (instead of just the character code point), which
> means we probably can't do it automatically.

That's the problem, yes.

> It also has to be a one-way transliteration; if ö -> o (or oe) then an
> o (or oe) in the result can't always be transliterated back.

The same is true for your "numeric transliteration": there is no way
to *reliably* tell whether some string is a mangled string, or
just happens to include U_ in the identifier (which it legally can
do today).

That's why Java and C++ use \u, so you would write L\u00F6wis
as an identifier. *This* is truly unambiguous. I claim that it
is also useless.

> (1)  They shouldn't ever need to see the numeric version unless
> they're intentionally peeking under the covers, or their site doesn't
> have the appropriate encoding installed.  One advantage of this method
> is that a single transliteration method could work for any language,
> so it probably would be installed already.

I think you are really arguing for \u escapes in identifiers here.

> (2)  Even if users did somehow see the numeric version, it wouldn't be
> that awful.  For the langauges close enough to ASCII that a
> transliteration is straightforward, the number of extra characters to
> memorize is fairly small.

What about the other languages? This PEP is not just for latin-based
scripts.

>> Then I don't understand your above proposal. I thought you were
>> proposing to replace all non-ASCII characters with some ASCII form
>> on import of the module. What do you mean by "readable internal
>> representation"?
> 
> This alternative would let an individual user say "I'm writing
> Swedish; turn my ö into an o."   The actual identifiers used by Python
> itself would be more readable, but the downside is that users would
> have to read them more often, instead of using/editing/viewing
> strictly in the untransliterated version.

That again cannot work because you don't have transliteration
algorithms for all characters, or all languages.

Regards,
Martin


More information about the Python-3000 mailing list