How do I automate the removal of all non-ascii characters from my code?

Steven D'Aprano steve+comp.lang.python at pearwood.info
Tue Sep 13 10:15:39 CEST 2011


On Tue, 13 Sep 2011 05:49 pm jmfauth wrote:

> On 12 sep, 23:39, "Rhodri James" <rho... at wildebst.demon.co.uk> wrote:
> 
> 
>> Now read what Steven wrote again.  The issue is that the program contains
>> characters that are syntactically illegal.  The "engine" can be perfectly
>> correctly translating a character as a smart quote or a non breaking
>> space or an e-umlaut or whatever, but that doesn't make the character
>> legal!
>>
> 
> Yes, you are right. I did not understand in that way.
> 
> However, a small correction/precision. Illegal character
> do not exit. One can "only" have an ill-formed encoded code
> points or an illegal encoded code point representing a
> character/glyph.

You are wrong there. There are many ASCII characters which are illegal in
Python source code, at least outside of comments and string literals, and
possibly even there.

>>> code = "x = 1 + \b 2"  # all ASCII characters
>>> print(code)
x = 1 + 2
>>> exec(code)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 1
    x = 1 + 2
            ^
SyntaxError: invalid syntax


Now, imagine that somehow a \b ASCII backspace character somehow gets
introduced into your source file. When you go to run the file, or import
it, you will get a SyntaxError. Changing the encoding will not help.



-- 
Steven




More information about the Python-list mailing list