[Python-3000] BOM handling

Thu Sep 14 10:04:33 CEST 2006

Antoine Pitrou wrote:
> Hi,
> 
> Le mercredi 13 septembre 2006 à 16:14 -0700, Josiah Carlson a écrit :
>> In any case, I believe that the above behavior is correct for the
>> context.  Why?  Because utf-8 has no endianness, its 'generic' decoding
>> spelling of 'utf-8' is analagous to all three 'utf-16', 'utf-16-be', and
>> 'utf-16-le' decoding spellings; two of which don't strip.
> 
> Your opinion is probably valid in a theoretical point of view. You are
> more knowledgeable than me.
> 
> My point was different : most programmers are not at your level (or
> Paul's level, etc.) when it comes to Unicode knowledge. Py3k's str type
> is supposed to be an abstracted textual type to make it easy to write
> unicode-friendly applications (isn't it?).
> Therefore it should hide the messy issue of superfluous BOMs, unwanted
> BOMs, etc. Telling the programmer to use a specific UTF-8 variant
> specialized in BOM-stripping will make eyes roll... "why doesn't the
> standard UTF-8 do it for me?"

I've been reading this thread (and the ones that spawned it), and 
there's something about it that's been nagging at me for a while, which 
I am going to attempt to articulate.

The basic controversy centers around the various ways in which Python 
should attempt to deal with character encodings on various platforms, 
but my question is "for what use cases?" To my mind, trying to ask "how 
should we handle character encoding" without indicating what we want to 
use the characters *for* is a meaningless question.

 From the standpoint of a programmer writing code to process file 
contents, there's really no such thing as a "text file" - there are only 
various text-based file formats. There are XML files, .ini files, email 
messages and Python source code, all of which need to be processed 
differently.

So when one asks "how do I handle text files", my response is "there 
ain't no such thing" -- and when you ask "well, ok, how do I handle 
text-based file formats", my response is "well it depends on the format".

Yes, there are some operations which can operate on textual data 
regardless of file format (i.e. grep), but these generic operations are 
so basic and uninteresting that one generally doesn't need to write 
Python code to do them. And even the case of simple unix utilities such 
as 'cat', *some* a priori knowledge of the file's encoded meaning is 
required - you can't just concatenate two XML files and get anything 
meaningful or valid. Running 'sort' on Python source code is unlikely to 
   increase shareholder value or otherwise hold back the tide of entropy.

Any given Python program that I write is going to know *something* about 
the format of the files that it is supposed to read/write, and the most 
important consideration is knowledge of what kinds of other programs are 
going to produce or consume that file. If the file that I am working 
with conforms to a standard (so that the number of producer/consumer 
programs can be large without me having to know the specific details of 
each one) then I need to understand that standard and constraints of 
what is legal within it.

For files with any kind of structure in them, common practice is that we 
don't treat them as streams of characters, rather we generally have some 
abstraction layer that sits on top of the character stream and allows us 
to work with the structure directly. Thus, when dealing with XML one 
generally uses something like ElementTree, and in fact manipulating XML 
files as straight text is actively discouraged.

So my whole approach to the problem of reading and writing is to come up 
with a collection of APIs that reflect the common use patterns for the 
various popular file types. The benefit of doing this is that you don't 
waste time thinking about all of the various file operations that don't 
apply to a particular file format. For example, using the ElementTree 
interface, I don't care whether the underlying file stream supports 
seek() or not - generally one doesn't seek into the middle of an XML, so 
there's no need to support that feature. On the other hand, if one is 
reading a bdb file, one needs to seek to the location of a record in 
order to read it - but in such a case, the result of the seek operation 
is well-defined. I don't have to spend time discussing what will happen 
if I seek into the middle of an encoded multi-byte character, because 
with a bdb file, that can't happen.

It seems to me that a lot of the conundrums that have been discussed in 
this thread have to do with hypothetical use cases - 'Well, what if I 
use operation X on a file of format Y, for which the result is 
undefined?' My answer is "Don't do that."

-- Talin