Guido van Rossum writes:
I have definitely seen BOMs written by Notepad on Windows 10.
I'm not clear on what circumstances we care if a UTF-8 file has or doesn't have a UTF-8 signature. Most software doesn't care, it just reads it and spits it back out if it's there and hasn't been edited out. If people are seeing UTF-16 BOMs, that may be worth detecting, depending on how often and how much trouble it is to deal with them. I'm just saying that I never see them. I was pretty careful about saying that my sample is quite restricted. However ...
Why can’t the future be that open() in text mode guesses the encoding?
The medium-term future is UTF-8 in all UIs and public APIs, except for archivists. I think we all agree on that. There are two issues with encoding guessing. The statistically unimportant one (at least for UTFs) is that guessing is guessing. It will get it wrong. The people who want guessing are mostly people who will be hurt most by wrong guesses. Second, and a real issue for design AFAICS: if you introduce detection of other encodings to 'open', the programmer may need to (1) discover that encoding in order to match it on output (open does not return that), or (2) choose the correct encoding on output, which may or may not be the detected one depending on what the next software in the pipeline expects. At that point "in the face of ambiguity" really does bind, "although practicality" notwithstanding. I'm not sure that putting detection into 'open' solves any problems, it just pushes them into other parts of the code. Remark: As I understand it, Naoki's proposal is about the casual coder in a monolingual environment where either defaulting to getpreferredencoding DTRTs or they need UTF-8 because some engineer decided "UTF-8 is the future, and in my project the future is now!" I don't think it's intended to be more general than that, but you'll have to ask him about that. Steve