[Python-3000] BOM handling
p.f.moore at gmail.com
Thu Sep 14 14:50:49 CEST 2006
On 9/14/06, Talin <talin at acm.org> wrote:
> I've been reading this thread (and the ones that spawned it), and
> there's something about it that's been nagging at me for a while, which
> I am going to attempt to articulate.
> Any given Python program that I write is going to know *something* about
> the format of the files that it is supposed to read/write, and the most
> important consideration is knowledge of what kinds of other programs are
> going to produce or consume that file. If the file that I am working
> with conforms to a standard (so that the number of producer/consumer
> programs can be large without me having to know the specific details of
> each one) then I need to understand that standard and constraints of
> what is legal within it.
There *is* still an issue, which is that Python needs to supply tools
to cater for naive users writing naive programs to parse/produce
ad-hoc text based file formats. For example, someone sent me this file
of data, and I want to parse it and convert it into some other format
(load it into a database, generate XML, whaterver). In my experience,
in these cases:
1. Nobody tells me the character encoding used.
2. 99.9% of the data is ASCII - so there's very little basis for guessing.
3. The whole process isn't an exact science - I *expect* to have to do
a bit of manual tidying up.
Or it's all ASCII and it *really* doesn't matter.
Those are the bulk of my use cases. For them, I'd be happy with the
"system code page" (even though Windows has two, one for console and
one for GUI, that wouldn't bother me if it was visible to me). I
wouldn't mind UTF-8, or latin-1, or anything much. It's only that 0.1%
of cases where I expect to need to check and possibly intervene, so no
On the other hand, getting an error *would* bother me. In Python 2.x,
I get no error because I don't convert to Unicode. In Python 3.x, I
fear that I might, because someone expects me to care about that 0.1%.
And no, it's not good enough for me to be able to set a global option
- that's boilerplate I'd rather do without.
More information about the Python-3000