Defining Python Source Code Encodings

Wed Jul 18 01:48:58 EDT 2001

> 3. Python's tokenizer/compiler combo will need to be updated to
>    work as follows:
>
>    1. read the file
>    2. decode it into Unicode assuming a fixed per-file encoding
>    3. tokenize the Unicode content
>    4. compile it, creating Unicode objects from the given Unicode data
>       and creating string objects from the Unicode literal data
>       by first reencoding the Unicode data into 8-bit string data
>       using the given file encoding
>
>    To make this backwards compatible, the implementation would have to
>    assume Latin-1 as the original file encoding if not given (otherwise,
>    binary data currently stored in 8-bit strings wouldn't make the
>    roundtrip).

If I understand this, you would translate (my) ascii code files into
Unicode, compile, and translate literal strings back to the ascii form they
started as.  Can this be done without lengthening the compile time 'too
much'.

> Issues that still need to be resolved:

> - what to do with non-literal data in the source file, e.g.
>   variable names and comments:
>
>   * reencode them just as would be done for literals
>   * only allow ASCII for certain elements like variable names
>   etc.

I strongly suspect that people who do not write any Latin alphabet language
would strongly prefer to write names and comments in their native script.
This would open Python to millions who are presently excluded.
Mixed-alphabet texts are pretty common in some non-Latin alphabet
countries.