[Python-ideas] Python 3 open() text files: make encoding parameter optional for cross-platform scripts

Sun Jun 9 03:10:30 CEST 2013

From: anatoly techtonik <techtonik at gmail.com>
Sent: Saturday, June 8, 2013 6:13 AM

>Without reading subject of this letter, what is your idea about which encoding Python 3 uses with open() calls on a text file? Please write in reply and then scroll down.

It uses the system locale encoding (which is slightly complicated on Windows by the fact that there are two different things that count as default in different cases, but whichever one is the right one is the one Python uses). That's why I can write a file in vi/emacs/TextEdit/Notepad/whatever and read it in a Python script (or vice-versa) and everything works.

>open() in Python uses system encoding to read files by default. So, if Python script writes text file with some Cyrillic character on my Russian Windows, another Python script on English Windows or Greek Windows will not be able to read it. This is just what happened.

True. But if I create a text file with "type >foo.txt" or Notepad on a Russian system, I won't be able to open it on that English or Greek system either… but at least I'll be able to open it on Python on the same Russian system. If you changed Python to ignore the locale, that would cause a new problem (the latter would no longer be true) without fixing any existing problem (the former would still not be true).

>The solution proposed is to specify encoding explicitly. That means I have to know it. Luckily, in this case the text file is my .py where I knew the encoding beforehand. In real world you can never know the encoding beforehand.

This is an inherent problem that Python didn't cause, and can't solve. As long as there are text files in different encodings out there, you need to pass the encoding out-of-band. If you're behind the process that creates the files (whether it's a program you wrote, or options you set in Notepad's Save As dialog), you can just make sure to use the same encoding on every system, and you have no problem. But if you need to deal with files that others have created, that won't work.

>So, what should Python do if it doesn't know the encoding of text file  it opens:
>1. Assume that encoding of text file is the encoding of your operating system
>2. Assume that encoding of text file is ASCII
>3. Assume that encoding of text file is UTF-8
>

>Please write in reply and then scroll down.

That order happens to be exactly my preference. #1 helps for one very common problem—files created by other programs on the same machine. #2 is generally at least safe, in that you'll get an error instead of mojibake. #3 doesn't really help anything.

>I propose three, because ASCII is a binary compatible subset of UTF-8. Choice one is the current behaviour, and it is very bad. Troubleshooting this issue, which should be very common, requires a lot of prior knowledge about encodings and awareness of difference system defaults. For cross-platform work with text files this fact implicitly requires you to always use 'encoding' parameter for open().

I'm not sure what you mean by "cross-platform" here. Most non-Windows platforms nowadays set the locale to UTF-8 by default (and if you're using an older *nix, or deliberately chose not to use UTF-8 even though it's the default, you already know how to deal with these issues). So, it's really a Windows problem, if anything.

I can understand why Windows users are confused. While OS X and most linux distros decided that the only way to solve this problem was to push UTF-8 as hard as possible, Windows went a different way. For everything but text files (remember that encoding is an issue for stdio, filesystems, etc.), there's a UTF-16 API that you can use, and "native" apps use it consistently. And for text files, many "native" apps will save in UTF-8, with the legal-but-discouraged UTF-8 BOM, and can automatically load both UTF-8 and local files by checking for the BOM.

Python _could_ do the same thing. By default, opening a new file for "w" could select UTF-8 and write the BOM, opening a file for "r" (or "r+" or "a") could look for a BOM and decide whether to select UTF-8 or the locale. And that would improve interoperability on Windows. However, I think that would be a very bad idea. Most non-Windows programs, and even some Windows programs, do not expect the UTF-8 BOM, and will open the file with your locale charset anyway, and show three garbage characters at the front. That's why you sometimes get that "ï»¿" garbage at the start of files, and why you often get errors from programs after you try to write their INI, YAML, etc. files in Notepad, Word, etc.

Switching to UTF-8 would make it harder to read and write files created by other programs on the same machine—and it still wouldn't magically make you able to read and write files created on other machines, unless you only care about files created on recent *nix platforms. The only case it would help is making it easier to read and write files created by _your program_ without worrying about the local machine. While that isn't _nothing_, I don't think it's so important that we can just dismiss dealing with files created by other programs.

After all, you're presumably using plain text files, rather than some binary format or JSON or YAML or XML or whatever, because you want users to be able to view and edit those files, right?