[Python-3000] Pre-PEP: Easy Text File Decoding

Tue Sep 12 03:16:15 CEST 2006

On 9/11/06, Marcin 'Qrczak' Kowalczyk <qrczak at knm.org.pl> wrote:
> "Paul Prescod" <paul at prescod.net> writes:
>
> >> The bizarre Windows behavious of using different
> >> encodings for console and GUI programs doesn't
> >> bother me either. Really. I promise."
> >
> > So according to this philosophy, Windows and Mac users will probably
> > never be able to open UTF-8 documents by default even if every
> > Microsoft app generates and consumes UTF-8 by default, because
> > Microsoft and Apple will probably _never change the default locale_
> > for backwards compatibility reasons.
>
> This can be solved for file reading by making a "Windows locale"
> always consider UTF-8 BOM and switch to UTF-8 in this case.

That's fine but I don't see why we would turn that feature off for any
platform. Do you have a bunch of files hanging around starting with
zero-width non-breaking spaces?

> It's still unclear what to do for writing on Windows.

UTF-8 with BOM is the Microsoft preferred format. Maybe after
experimentation we'll find that there are still apps out there that
choke on it, but we should start out trying to be compatible with
other apps on the platform.

> I have no idea what Mac does (does it typically use UTF-8 locales?
> and does it typicaly use a BOM in UTF-8?).

Like Windows, the Mac has backwards-compatible behaviours in some
places (textedit defaults to a proprietary encoding called Mac Roman)
and UTF-8 behaviours in other places (e.g. cut and paste). In some
places (on my configuration) it claims its locale is US ASCII.

Textedit can read files with a BOM and auto-detect Unicode with a BOM.
It always saves without a BOM, which results in the unfortunate
situation that Textedit will recognize a file's encoding, then save
it, then forget its encoding when you reopen it. :(

But again, this implies that at least on these two platforms UTF-8
w/BOM is a good default output encoding.

On Unix, VIM is also set up to auto-detect UTF-8 (using the BOM or
full decoding attemption). According to Google, XEmacs also has some
kind of UTF-8/BOM detector but I don't know the details. GNU Emacs:
According to "Emacs wiki": "Auto-detection of UTF-8 is effectively
disabled by default in GNU Emacs 21.3 and below."

So the situation on Unix is not as clear.

 Paul Prescod