[Python-3000] BOM handling
talin at acm.org
Thu Sep 14 10:04:33 CEST 2006
Antoine Pitrou wrote:
> Le mercredi 13 septembre 2006 à 16:14 -0700, Josiah Carlson a écrit :
>> In any case, I believe that the above behavior is correct for the
>> context. Why? Because utf-8 has no endianness, its 'generic' decoding
>> spelling of 'utf-8' is analagous to all three 'utf-16', 'utf-16-be', and
>> 'utf-16-le' decoding spellings; two of which don't strip.
> Your opinion is probably valid in a theoretical point of view. You are
> more knowledgeable than me.
> My point was different : most programmers are not at your level (or
> Paul's level, etc.) when it comes to Unicode knowledge. Py3k's str type
> is supposed to be an abstracted textual type to make it easy to write
> unicode-friendly applications (isn't it?).
> Therefore it should hide the messy issue of superfluous BOMs, unwanted
> BOMs, etc. Telling the programmer to use a specific UTF-8 variant
> specialized in BOM-stripping will make eyes roll... "why doesn't the
> standard UTF-8 do it for me?"
I've been reading this thread (and the ones that spawned it), and
there's something about it that's been nagging at me for a while, which
I am going to attempt to articulate.
The basic controversy centers around the various ways in which Python
should attempt to deal with character encodings on various platforms,
but my question is "for what use cases?" To my mind, trying to ask "how
should we handle character encoding" without indicating what we want to
use the characters *for* is a meaningless question.
From the standpoint of a programmer writing code to process file
contents, there's really no such thing as a "text file" - there are only
various text-based file formats. There are XML files, .ini files, email
messages and Python source code, all of which need to be processed
So when one asks "how do I handle text files", my response is "there
ain't no such thing" -- and when you ask "well, ok, how do I handle
text-based file formats", my response is "well it depends on the format".
Yes, there are some operations which can operate on textual data
regardless of file format (i.e. grep), but these generic operations are
so basic and uninteresting that one generally doesn't need to write
Python code to do them. And even the case of simple unix utilities such
as 'cat', *some* a priori knowledge of the file's encoded meaning is
required - you can't just concatenate two XML files and get anything
meaningful or valid. Running 'sort' on Python source code is unlikely to
increase shareholder value or otherwise hold back the tide of entropy.
Any given Python program that I write is going to know *something* about
the format of the files that it is supposed to read/write, and the most
important consideration is knowledge of what kinds of other programs are
going to produce or consume that file. If the file that I am working
with conforms to a standard (so that the number of producer/consumer
programs can be large without me having to know the specific details of
each one) then I need to understand that standard and constraints of
what is legal within it.
For files with any kind of structure in them, common practice is that we
don't treat them as streams of characters, rather we generally have some
abstraction layer that sits on top of the character stream and allows us
to work with the structure directly. Thus, when dealing with XML one
generally uses something like ElementTree, and in fact manipulating XML
files as straight text is actively discouraged.
So my whole approach to the problem of reading and writing is to come up
with a collection of APIs that reflect the common use patterns for the
various popular file types. The benefit of doing this is that you don't
waste time thinking about all of the various file operations that don't
apply to a particular file format. For example, using the ElementTree
interface, I don't care whether the underlying file stream supports
seek() or not - generally one doesn't seek into the middle of an XML, so
there's no need to support that feature. On the other hand, if one is
reading a bdb file, one needs to seek to the location of a record in
order to read it - but in such a case, the result of the seek operation
is well-defined. I don't have to spend time discussing what will happen
if I seek into the middle of an encoded multi-byte character, because
with a bdb file, that can't happen.
It seems to me that a lot of the conundrums that have been discussed in
this thread have to do with hypothetical use cases - 'Well, what if I
use operation X on a file of format Y, for which the result is
undefined?' My answer is "Don't do that."
More information about the Python-3000