My personal experience of the most common problematic substitutions by tools such as Outlook, Word & some web tools:

  1. Double Quotes \u201c & \u201d “”
  2. Single Quotes \u2018 & \u2019 ‘’
  3. The m-hyphen \2013 –
  4. Copyright © \xa9 and others, Registered ® \xae and trademark ™ \u2122
  5. Some fractions e.g.  ½ ¼
  6. Non-breaking spaces

 

From: David Mertz <mertz@gnosis.cx>
Sent: 10 May 2020 18:33
To: Steven D'Aprano <steve@pearwood.info>
Cc: python-ideas <python-ideas@python.org>
Subject: [Python-ideas] Re: Improve handling of Unicode quotes and hyphens

 

On Sun, May 10, 2020 at 4:03 AM Steven D'Aprano <steve@pearwood.info> wrote:

I think that David(?) may have a Vim or Emacs mode that allows him to
use Unicode chars as syntax?

 

I use the vim-conceal plugin: https://github.com/khzaw/vim-conceal.  I know that something similar exists for Emacs, but don't remember the name.  What this does though is not change anything about the underlying ASCII characters in the code, but rather it substitutes particular character sequences (perhaps in regex context) with other things, such as fancy Unicode characters.

 

So as typing goes, I still type e.g. the letter 'i' followed by the letter 'n' and a space, and the screen simply displayed the U+2208 () character.  But on disk, and for Python, it's only still just 'in'. 

 

On my own system, I've learned the Unicode code points for the common things like n-dashes and m-dashes that I use.  I actually don't know the vim shortcuts for other special things, although I probably should.  Still, the vim digraphs are always going to be fewer than all the Unicode code points, even if some useful ones are included (and somewhat mnemonic).  But indeed, entry of all those special characters is going to be more work than the characters directly on my keyboard, in any event.

 

>   6.  Change the error message "SyntaxError: invalid character in
>   identifier" to include which character and it's Unicode value so
>   that it becomes "SyntaxError: invalid character 0x201c " in
>   identifier" -
More informative error messages are good :-)

 

 I wouldn't mind messages that actually looked specifically for some of those common annoying auto-substitutions.  E.g.:

 

% python ~/tmp/wrongchar.py
  File "/home/dmertz/tmp/wrongchar.py", line 1
    x = 2014 – 2013
             ^
SyntaxError: invalid character in identifier

 

The hyphen really does look a lot like the n-dash that is on screen.  And I think that's one of those substitutions that word processors and email clients often do.  Maybe a collection of the top 20 such common substitutions with some fitting message.  I dunno "SyntaxError: invalid character U+2013 may be substitution of ASCII dash".

 

--

The dead increasingly dominate and strangle both the living and the
not-yet born.  Vampiric capital and undead corporate persons abuse
the lives and control the thoughts of homo faber. Ideas, once born,
become abortifacients against new conceptions.