[Python-3000] Pre-PEP: Easy Text File Decoding

Tue Sep 12 02:41:59 CEST 2006

On 9/11/06, David Hopwood <david.nospam.hopwood at blueyonder.co.uk> wrote:
> > I disagree. If a non-trivial file can be decoded as a UTF-* encoding
> > it probably is that encoding.
>
> That is quite false for UTF-16, at least. It is also false for short UTF-8
> files.

True UTF-16 (as opposed to UTF-16 BE/UTF 16 LE) files have a BOM.
Also, you can recognize incorrect ones through misuse of surrogates.

> > I don't see how it matters whether the
> > file represents Latex or an .htaccess file. XML is a special case
> > because it is specially designed to make encoding detection (not
> > guessing, but detection) easy.
>
> Many other frequently used formats also necessarily start with an ASCII
> character and do not contain NULs, which is at least sufficient to reliably
> detect UTF-16 and UTF-32.

Yes, but these are the two easiest ones.

> > But this defeats the whole purpose of the PEP which is to accelerate
> > the writing of quick and dirty text processing scripts.
>
> That doesn't justify making the behaviour of those scripts "dirtier" than
> necessary.
>
> I think that the focus should be on solving a set of well-defined problems,
> for which BOM detection can definitely help:
>
> Suppose we have a system in which some of the files are in a potentially
> non-Unicode 'system' encoding, and some are Unicode. The user of the system
> needs a reliable way of marking the Unicode files so that the encoding of
> *those* files can be distinguished.

If the user understands the problem and is willing to go to this level
of effort then they are not the target user of the feature.

> ... In addition, a provider of portable
> software or documentation needs a way to encode files for distribution that
> is independent of the system encoding, since (before run-time) they don't
> know what encoding that will on any given system. Use and detection of
> Byte Order Marks solves both of these problems.

Sure, that's great.

> You appear to be arguing for the common use of much more ambitious heuristic
> guessing, which *cannot* be made reliable.

First, the word "guess" necessarily implies unreliability. Guido
started this whole chain of discussion when he said:

"(Auto-detection from sniffing the data is a perfectly valid answer
BTW -- I see no reason why that couldn't be one option, as long as
there's a way to disable it.)"

> ... I am not opposed to providing
> support for such guessing in the Python standard library, but only if its
> limitations are thoroughly documented, and only if it is not the default.

Those are both characteristics of the proposal that started this
thread so what are we arguing about?

Since writing the PEP, I've noticed that the strategy of trying to
decode as UTF-* and falling back to an 8-bit character set is actually
pretty common in text editors, which implies that Python's behaviour
here can be highly similar to text editors. This was the key
requirement Guido gave me in an off-list email for the guessing mode.

VIM: "fileencodings: This is a list of character encodings considered
when starting to edit a file.  When a file is read, Vim tries to use
the first mentioned character encoding.  If an error is detected, the
next one in the list is tried.  When an encoding is found that works,
'fileencoding' is set to it.	"

Reading the docs, one can infer that this feature is specifically
designed to support UTF-8 sniffing. I would guess that the default
configuration has it do UTF-8 sniffing.

BBEdit: "If the file contains no other cues to indicate its text
encoding, and its contents appear to be valid UTF-8, BBEdit will open
it as UTF-8 (No BOM) without recourse to the preferences option."

 Paul Prescod