[issue21765] Idle: make 3.x HyperParser work with non-ascii identifiers.

Sun Jul 6 23:03:01 CEST 2014

Tal Einat added the comment:

Indeed, I seem to have been misinterpreting the grammar, despite taking care and reading it several times. This strengthens my opinion that we should use str.isidentifier() rather than attempt to correctly re-implement just the parts that we need.

Attached is a patch which fixes HyperParser._eat_identifier(), to the extent of my testing (tests included).

When non-ASCII characters are encountered, this patch uses Terry's suggestion of checking for valid identifier characters using ('a' + string_part).isidentifier(). It also employs his suggestion of how to avoid executing this check at every index, by skipping 4 characters at a time.

However, even with this fix, HyperParser.get_expression() still fails with non-ASCII Unicode strings. This is because it uses PyParse, which doesn't support Unicode! For example, it apparently replaces all non-ASCII characters with 'x'. I've added (in this patch) a few tests for this, which currently fail.

FWIW, PyParse includes a comment to this effect[1]:

<quote>
The parse functions have no idea what to do with Unicode, so
replace all Unicode characters with "x".  This is "safe"
so long as the only characters germane to parsing the structure
of Python are 7-bit ASCII.  It's *necessary* because Unicode
strings don't have a .translate() method that supports
deletechars.
</quote>

Properly resolving this issue will apparently require fixing PyParse to properly support Unicode.

.. [1]: http://hg.python.org/cpython/file/d25ae22cc992/Lib/idlelib/PyParse.py#l117

----------
keywords: +patch
Added file: http://bugs.python.org/file35876/taleinat.20140706.IDLE_HyperParser_unicode_ids.patch

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue21765>
_______________________________________