Hi All,

 

Apologies if this has already been discussed to death.

 

Python 3 allows Unicode characters in strings and identifiers but the actual quotation marks are only accepted in plain ASCII, i.e. the following all successfully initialise strings:

 

```

S1 = "Double Quoted" # Opened and closed with chr(34)0x22

S2 = 'Single Quoted' # Opened and closed with chr(39)0x27

```

But the following all result in an error – “SyntaxError: invalid character in identifier”:

 

```

S1 = “Double Quoted” # Opened with \u201c and closed with \u201d

S2 = ‘Single Quoted’ # Opened with \u2018 and closed with \u2019

```

To the experienced eye, and depending on the character font used, it is “obvious” what the problem is! The wrong quotation marks were used. The big problem, especially for beginners, is that the same keys were typed, just in the “wrong” editor or even the wrong editor mode or context I have found that in Outlook if the font is FixSys or I am replying to a plain text email it is fine but otherwise it is “helpful” – unfortunately, especially on Windows, “wrong” editors abound and include, but are not limited to, MS-Outlook, MS-Word, some online editing environments such as Quora.

 

On top of that is the helpful substitution of a m-hyphen for minus when you press space a word later so:

 

A = 3 – 2 # With a space syntax error due to \u2013

A = 3 - 2 # No Space or CR after I last typed it is OK as 0x2d

 

Use cases that catch people out:

  1. Sending a code snipped by email using Outlook
  2. User manuals written in MS-Word – (many peoples work environment)
  3. Articles published on Quora – people expect to be able to copy and paste the code for some reason.

 

I am sure that many us have encountered these issues or similar.

 

What can be done?

  1. Persuade Microsoft, and others, to stop being so helpful by default – good luck with that!
  2. Tell all users that they need to use a “proper” editor or IDE – This seems like adding an additional barrier to new & casual users.
  3. Better yet tell them to use a “proper” OS like …. – At the very least many of us have to use Windows at work.
  4. Start accepting hyphens as minus & Unicode quotation marks – this would be the ideal answer for pasted code but has a lot of possible things to iron out such as do we require that the quotes match and are in the typographically correct order. It is also quite a big & complex change to the python interpreter.
  5. Normalise the input to the python interpreter (at least for these characters and possibly a few others) so that entering or reading from a file S1 = “Double Quoted” becomes S1 = "Double Quoted", etc. – this should be a easier change to the interpreter but, from a purist point of view, could be said to make us as bad as the others because we are not honouring what the user entered.
  6. Change the error message “SyntaxError: invalid character in identifier” to include which character and it’s Unicode value so that it becomes  “SyntaxError: invalid character 0x201c “  in identifier” – this is almost certainly the easiest change and fits well with explicit is better than implicit but still leaves it to the user to correct the erroneous input (which could be argued is both good and bad).

 

I would like to suggest that an incremental approach might be the best – clarifying the existing error message being the thing that should not break anything and either substituting for problem characters or processing them “properly” as a later enhancement.

 

Steve Barnes