[Python-3000] Pre-PEP: Easy Text File Decoding

Tue Sep 12 01:25:15 CEST 2006

Paul Prescod wrote:
> On 9/10/06, David Hopwood <david.nospam.hopwood at blueyonder.co.uk> wrote:
> 
>> ... if you think that guessing based on content is a good idea -- I
>> don't. In any case, such guessing necessarily depends on the expected file
>> format, so it should be done by the application itself, or by a library that
>> knows more about the format.
> 
> I disagree. If a non-trivial file can be decoded as a UTF-* encoding
> it probably is that encoding.

That is quite false for UTF-16, at least. It is also false for short UTF-8
files.

> I don't see how it matters whether the
> file represents Latex or an .htaccess file. XML is a special case
> because it is specially designed to make encoding detection (not
> guessing, but detection) easy.

Many other frequently used formats also necessarily start with an ASCII
character and do not contain NULs, which is at least sufficient to reliably
detect UTF-16 and UTF-32.

>> If the encoding of a text stream were settable after it had been opened,
>> then it would be easy for anyone to implement whatever guessing algorithm
>> they needed, without having to write an encoding implementation or
>> include any other support for guessing in the I/O library itself.
> 
> But this defeats the whole purpose of the PEP which is to accelerate
> the writing of quick and dirty text processing scripts.

That doesn't justify making the behaviour of those scripts "dirtier" than
necessary.

I think that the focus should be on solving a set of well-defined problems,
for which BOM detection can definitely help:

Suppose we have a system in which some of the files are in a potentially
non-Unicode 'system' encoding, and some are Unicode. The user of the system
needs a reliable way of marking the Unicode files so that the encoding of
*those* files can be distinguished. In addition, a provider of portable
software or documentation needs a way to encode files for distribution that
is independent of the system encoding, since (before run-time) they don't
know what encoding that will on any given system. Use and detection of
Byte Order Marks solves both of these problems.

You appear to be arguing for the common use of much more ambitious heuristic
guessing, which *cannot* be made reliable. I am not opposed to providing
support for such guessing in the Python standard library, but only if its
limitations are thoroughly documented, and only if it is not the default.

-- 
David Hopwood <david.nospam.hopwood at blueyonder.co.uk>