On Tue, Jul 19, 2016 at 7:21 AM Steven D'Aprano <steve@pearwood.info> wrote:

On Mon, Jul 18, 2016 at 10:29:34PM -0700, Rustom Mody wrote:

> There was this question on the python list a few days ago:
> Subject: SyntaxError: Non-ASCII character
[...]
> I pointed out that the python2 error was more helpful (to my eyes) than
> python3s

And I pointed out how I thought the Python 3 error message could be
improved, but the Python 2 error message was not very good.

> Python3
>
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/home/ariston/foo.py", line 31
> wf = wave.open(“test.wav”, “rb”)
> ^
> SyntaxError: invalid character in identifier

It would be much more helpful if the caret lined up with the offending
character. Better still, if the offending character was actually stated:

wf = wave.open(“test.wav”, “rb”)
^
SyntaxError: invalid character '“' in identifier

> Python2
>
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "foo.py", line 31
> SyntaxError: Non-ASCII character '\xe2' in file foo.py on line 31, but no
> encoding declared; see http://python.org/dev/peps/pep-0263/ for details

As I pointed out earlier, this is less helpful. The line itself is not
shown (although the line number is given), nor is the offending
character. (Python 2 can't show the character because it doesn't know
what it is -- it only knows the byte value, not the encoding.) But in
the person's text editor, chances are they will see what looks to them
like a perfectly reasonable character, and have no idea which is the
byte \xe2.

> IOW
> 1. The lexer is internally (evidently from the error message) so
> ASCII-oriented that any “unicode-junk” just defaults out to identifiers
> (presumably comments are dealt with earlier) and then if that lexing action
> fails it mistakenly pinpoints a wrong *identifier* rather than just an
> impermissible character like python 2

You seem to be jumping to a rather large conclusion here. Even if you
are right that the lexer considers all otherwise-unexpected characters
to be part of an identifier, why is that a problem?

It's a problem because those characters could never be part of an identifier. So it seems like a bug.

I agree that it is mildly misleading to say

invalid character '“' in identifier

when “ is not part of an identifier:

py> '“test'.isidentifier()
False

but I don't think you can jump from that to your conclusion that
Python's unicode support is somewhat "wrongheaded". Surely a much
simpler, less inflammatory response would be to say that this one
specific error message could be improved?

But... is it REALLY so bad? What if we wrote it like this instead:

py> result = my§function(arg)
File "<stdin>", line 1
result = my§function(arg)
^
SyntaxError: invalid character in identifier

Isn't it more reasonable to consider that "my§function" looks like it is
intended as an identifier, but it happens to have an illegal character
in it?

> combine that with
> 2. matrix mult (@) Ok to emulate perl but not to go outside ASCII

How does @ emulate Perl?

As for your second part, about not going outside of ASCII, yes, that is
official policy for Python operators, keywords and builtins.

> makes it seem (to me) python's unicode support is somewhat wrongheaded.

--
Steve
_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

--

---
You received this message because you are subscribed to a topic in the Google Groups "python-ideas" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/python-ideas/-gsjDSht8VU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to python-ideas+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.