[Python-Dev] Divorcing str and unicode (no more implicitconversions).

Wed Oct 26 20:40:14 CEST 2005

"Martin v. Löwis" <martin at v.loewis.de> wrote:
> 
> Josiah Carlson wrote:
> > In this case it's not just a misreading, the characters look identical! 
> > When is an 'E' not an 'E'?  When it is an Epsilon or Ie.  Saying what
> > characters will or will not be used as identifiers, when those
> > characters are keys on a keyboard of a specific type, is pretty
> > presumptuous.
> 
> Why is that rude and disrespectful? I'm certainly respecting developers
> who want to use their scripts for identifiers, or else I would not have
> suggested that they could do so.

I never said rude, I said presumptuous.  "Going beyond what is right or
proper; excessively forward." (according to dictionary.com, the OED has
a similar definition).  I was trying to say that in stating that users
wouldn't be using keys on their keyboard in their natual language when
also using english characters, that you were assuming a bit about their
usage patterns that you perhaps shouldn't.  I certainly could also be
presumptuous in stating that users may very well mix certain languages,
but it seems to be more likely given keywords and the standard library
using the latin alphabet.

> > Indeed, they are similar, but_ different_ in my font as well.  The trick
> > is that the glyphs are not different in the case of certain greek or
> > cyrillic letters.  They don't just /look/ similar they /are identical/.
> 
> This string: "EÎ•" is the LATIN CAPITAL LETTER E, followed by the GREEK
> CAPITAL LETTER EPSILON. In the font my email composer uses, the E is
> slightly larger than the Epsilon - so there /is/ a visual difference.

My email client doesn't handle unicode, but a quick check by swapping
fonts in a word processor provides that at least on my platform, all
three are the same glyph (same size, shape, ...) for all fixed-width
fonts. If a platform distinguishes all three, then one should consider
one's platform lucky.  Not all platforms and/or preferred fonts of users
are.

> But even if there isn't: if this was a frequent problem, the name
> error could include an alternative representation (say, with Unicode
> ordinals for non-ASCII characters) which would give an easy visual
> clue.

It would offer a great cue, but I'm not sure if it is possible.  I think
that it sounds like an ugly discussion of stdout/err encodings and
exception handling machinery that I don't want to be a part of.

> I still doubt that this is a frequent problem, and I don't see any
> better grounds for claiming that it is than for claiming that it
> is not.

Whether or not it is frequent will depend on the prevalence of desire to
use those characters.  While I don't think that such uses will be as
common as using 'klass' when passing a class, I do think that it will
result in more than a few sf bug reports.  I also share Marc-Andre
Lemburg's concerns about the understandability of code written in Kanji,
Hebrew, Arabic, etc., at least for those who have not memorized the
entirety of those alphabets.

 - Josiah