[Python-3000] BOM handling

Thu Sep 14 15:13:19 CEST 2006

Talin wrote:
>> My point was different : most programmers are not at your level (or
>> Paul's level, etc.) when it comes to Unicode knowledge. Py3k's str type
>> is supposed to be an abstracted textual type to make it easy to write
>> unicode-friendly applications (isn't it?).
> 
> The basic controversy centers around the various ways in which Python 
> should attempt to deal with character encodings on various platforms, 
> but my question is "for what use cases?" To my mind, trying to ask "how 
> should we handle character encoding" without indicating what we want to 
> use the characters *for* is a meaningless question.

Contrary to all expectations, this thread has helped me in my day job 
already.  I'm about to start writing a program (in Python, natch) which 
will take a set of files, and perform simple token substitution on them, 
replacing tokens of the form %STUFF.format% with the value of the STUFF 
token looked up in another (XML, thus Unicode by the time it gets to me) 
file.

The files I'll be substituting in will be in various encodings, and I'll 
be creating new files which must have the same encoding.  Sadly, I don't 
know what all the encodings are.  (The Windows Resource Compiler takes 
in .rc files, but I can't find any suggestion of what encoding those 
use.  Anyone here know?)

The first version of the spec naively mentioned nothing about encodings, 
and so I raised a red flag about that, seeing that we would have 
problems, and that the right thing to do in this case isn't clear.

Um, what more data do we need for this use-case?  I'm not going to 
suggest an API, other than it would be nice if I didn't have to manually 
figure out/hard code all the encodings.  (It's my belief that I will 
currently have to do that, or at least special-case XML, to read the 
encoding attribute.)  Oh, and it would be particularly horrible if I 
output a shell script in UTF-8, and it included the BOM, since I believe 
that would break the "magic number" of "#!".

(To test it in vim, set the following options:
:set encoding=utf-8
:set bomb
)

Jennifer:~ bwinton$ xxd test
0000000: efbb bf23 2120 2f62 696e 2f62 6173 680a  ...#! /bin/bash.
0000010: 6563 686f 204a 7573 7420 7465 7374 696e  echo Just testin
0000020: 672e 2e2e 0a                             g....
Jennifer:~ bwinton$ ./test
-bash: ./test: cannot execute binary file

Jennifer:~ bwinton$ xxd test
0000000: 2321 202f 6269 6e2f 6261 7368 0a65 6368  #! /bin/bash.ech
0000010: 6f20 4a75 7374 2074 6573 7469 6e67 2e2e  o Just testing..
0000020: 2e0a                                     ..
Jennifer:~ bwinton$ ./test
Just testing...

>  From the standpoint of a programmer writing code to process file 
> contents, there's really no such thing as a "text file" - there are only 
> various text-based file formats. There are XML files, .ini files, email 
> messages and Python source code, all of which need to be processed 
> differently.

Yeah, see, at a business level, I really need to process those all in 
the same way, and it would be annoying to have to write code to handle 
them all differently.

> For files with any kind of structure in them, common practice is that we 
> don't treat them as streams of characters, rather we generally have some 
> abstraction layer that sits on top of the character stream and allows us 
> to work with the structure directly.

Your common practice, perhaps.  I find myself treating them as streams 
of characters as often as not, because I neither need nor care to 
process the structure.  Heck, even in my source code, I grep more often 
than I use the fancy "Find Usages" button (if only because PyDev in 
Eclipse doesn't let me search for all the usages of a function).

> So my whole approach to the problem of reading and writing is to come up 
> with a collection of APIs that reflect the common use patterns for the 
> various popular file types.

That sounds great.  Can you also come up with an API for the files that 
you don't consider to be in common use?  And if so, that's the one that 
everyone is going to use.  (I'm not saying that to be contrary, but 
because I honestly believe that that's what's going to happen.  If 
there's a choice between using one API for all your files, and using n 
APIs for all your files, my money is always going to be on the one. 
Maybe XML will have enough traction to make it two, but certainly no 
more than that.)

Later,
Blake.