[Tutor] How to read the first so many Unicode characters from a file?

Eryk Sun eryksun at gmail.com
Sat Jun 26 01:24:09 EDT 2021


On 6/25/21, boB Stepp <robertvstepp at gmail.com> wrote:
>
> If I specify the encoding at the top of the program file, will that
> suffice for overcoming Windows code page issues -- being ASCII not
> UTF-8?

One rarely needs to specify and encoding for a Python 3 source file,
since the default is UTF-8. Even if you do, it has nothing to do with
how the code executes once the file is compiled to bytecode.

To force UTF-8 as the default encoding for open(), enable UTF-8 mode
in Python 3.7+. You can enable it permanently by defining the
environment variable PYTHONUTF8=1, e.g. run `setx.exe PYTHONUTF8 1` at
the command line or the Win+R run dialog.

https://docs.python.org/3/using/cmdline.html#envvar-PYTHONUTF8

FYI, ANSI code pages in Windows are single-byte or double-byte
encodings. They often extend 7-bit ASCII, but no Windows locale uses
just ASCII. In Western European and American locales, the ANSI code
page is 1252, which is a single-byte encoding that extends Latin-1,
which extends ASCII. Five byte values in code page 1252 are not mapped
to any character: 0x81, 0x8D, 0x8F, 0x90, 0x9D. These values can occur
in UTF-8 sequences, in which case decoding as 1252 will fail instead
of just returning mojibake nonsense. Here's an example of mojibake:

    >>> '\ufeff'.encode('utf-8').decode('1252')
    ''


More information about the Tutor mailing list