[Python-3000] BOM handling

Thu Sep 14 19:56:11 CEST 2006

Josiah Carlson wrote:
> Blake Winton <bwinton at latte.ca> wrote:
>> I'm not going to 
>> suggest an API, other than it would be nice if I didn't have to manually 
>> figure out/hard code all the encodings.  (It's my belief that I will 
>> currently have to do that, or at least special-case XML, to read the 
>> encoding attribute.)
> Use the XML tag/attribute "<?xml ... encoding="..." ?> to discover the
> encoding and assume utf-8 otherwise as per spec:
> http://www.w3.org/TR/2000/REC-xml-20001006#NT-EncodingDecl

Yeah, but now you're requiring me to read and understand the file's 
contents, which is something I (as someone who doesn't particularly care 
about all this "encoding" stuff) am trying very hard not to do.  Does 
no-one write generic text processing programs anymore?

If I were to write a program which rotated an image using PIL, I 
wouldn't have to care whether it was a png or a jpeg.  (At least, I'm 
pretty sure I wouldn't.  I haven't tried recently.)

>> Oh, and it would be particularly horrible if I 
>> output a shell script in UTF-8, and it included the BOM, since I believe 
>> that would break the "magic number" of "#!".
> Does bash natively support utf-8?

A quick Google gives me:
-------------------------
About bash utf-8:
Bash is the shell, or command language interpreter, that will appear in 
the GNU operating system. It is default shell for BeOS.

By default, GNU bash assumes that every character is one byte long and 
one column wide. It may cause several problems for all non-english BeOS 
users, especially with file names using national characters. A patch for 
bash 2.04, by Marcin 'Qrczak' Kowalczyk and Ricardas Cepas, teaches bash 
about multibyte characters in UTF-8 encoding, and fixes those problems.
Double-width characters, combining characters and bidi are not supported 
by this patch.
-------------------------
which I'm mainly posting here because of the reference to Marcin 
'Qrczak' Kowalczyk.  Small world, but I wouldn't want to paint it.

 > Is there a bash equivalent to Python coding: directives?  You may be
 > attempting to fix a problem that doesn't exist.

I don't know if the magic number stuff to determine whether a file is 
executable or not is bash-specific.  Either way, when I save the file in 
UTF-8, it's fine, but when I save it in UTF-8 with a BOM, it fails.

>> Yeah, see, at a business level, I really need to process those all in 
>> the same way, and it would be annoying to have to write code to handle 
>> them all differently.
> So you, or anyone else, can write a module for discovering the encoding
> used for a particular file based on XML tags, Python coding: directives,
> etc. It could include an extensible registry, and if it is used enough,
> could be included in the Python standard library.

Okay, so what will happen for file types which aren't in the registry, 
like that Windows .rc files?

I was lying up above when I said that I don't care about this sort of 
thing.  I do care, but I also believe that I am, and should be, in the 
minority, and that if we can't ship something that will work for people 
who don't care about this stuff, then we've failed both them and Python.

Later,
Blake.