[Python-ideas] Python 3 open() text files: make encoding parameter optional for cross-platform scripts

Andrew Barnert abarnert at yahoo.com
Sun Jun 9 23:52:32 CEST 2013


On Jun 9, 2013, at 12:59, Yuval Greenfield <ubershmekel at gmail.com> wrote:

> It's plagued websites, browsers, email clients, adobe photoshop and premiere, excel, word, and powerpoint. It's always been a guessing game when a friend would call for help proclaiming "all I'm getting is Chinese" which is the written gibberish euphemism used around here. Sometimes it's just the word or letter ordering that's messed up (Hebrew is an RTL language). Most Israelis have experienced and fear this phenomenon.
> 
> If I were to try and fix a problem I'd either be using notepad with its heuristics or iterating through the above options.

There's definitely a case to be made for implementing some kind of Notepad-like heuristics in Python. It would be great to be able to do this at the interactive interpreter:

line = text.partition('\n')[0]
for encoding in codecs.guess(text)[:10]:
    print(encoding, line.decode(encoding))

In fact, if you wrote that at pushed it to PyPI I'd start using it today, and maybe even lobbying for its inclusion in the stdlib.

But I wouldn't want open to use it, and I don't think you would either.

> Sometimes the above encodings were the platform's (windows') default encoding, but in my experience it was mainly applications or websites that chose their encoding for whatever reasons.

But open wouldn't affect those things anyway. You're dealing with urlopen or socket.makefile or a file you've opened as binary and extracted text from.

The fact that local text files benefit from assuming the default encoding, but nothing else does, is an argument for, not against, the 3.x status quo: open (in text mode) assumes the default encoding, even though nothing else does. 

> E.g. Windows Internals 4th edition promoted ucs-2 as the killer encoding that all windows applications should be implemented with.

Yes, Microsoft strongly encouraged first UCS-2, then UTF-16, consistently for over 15 years, and their APIs are still all built around it. But note that the one exception they've always made is in text files.

If you want to save a file in UTF-16, you don't use the "narrow" API functions and en/decode to UTF-16; you use the wide functions. The narrow functions are for writing in the OEM codepage. That's why programs like Notepad, going back to 95 and NT 3 gave you separate "Save As" options for "Text File" and "Unicode Text File", instead of a pulldown or checkbox to select an encoding. A (narrow) text file is a file in your OEM code page, period.

For a while, Microsoft encouraged you to save files in UTF-8 with an explicit BOM, even though that's strongly discouraged by the standards. But they've never suggested just writing UTF-8 to text files, without the BOM, instead of using the OEM codepage.

I'm not saying that this was a good decision by Microsoft, or that it doesn't have bad repercussions today. Just that the least bad answer for Python on Windows is what Python 3 already does.

> Perhaps you guys are used to more os-encoding-abiding applications and value that quality.

Yes, it has helped me numerous times in the past, with ini files and log files, files generated by DOS programs and ports from Unix, etc. I've been able to take Python code that I wrote on other platforms and use it on Windows and—not every time, but more often than not—it just worked.

And in the cases where it didn't work, it's usually been because the Windows files were in UTF-16, so defaulting to UTF-8 wouldn't have helped anything. 

> That kind of consistency indeed would have saved me from at least some heart ache. I just wish we can get rid of these problems for good, and promoting utf-8 everywhere is one way to go about it.

I agree wholeheartedly. But the only reasonable fix that will solve the problem for text files is Microsoft doing what Apple, Red Hat, Ubuntu, Google, etc. did—ship systems where the default encoding is UTF-8 in every region. And I don't think Python working poorly with local text files on Windows would be a significant stick beating them in that direction.


More information about the Python-ideas mailing list