[Python-3000] BOM handling

Thu Sep 14 18:28:39 CEST 2006

Blake Winton <bwinton at latte.ca> wrote:
[snip]
> Um, what more data do we need for this use-case?  I'm not going to 
> suggest an API, other than it would be nice if I didn't have to manually 
> figure out/hard code all the encodings.  (It's my belief that I will 
> currently have to do that, or at least special-case XML, to read the 
> encoding attribute.)  Oh, and it would be particularly horrible if I 
> output a shell script in UTF-8, and it included the BOM, since I believe 
> that would break the "magic number" of "#!".

Use the XML tag/attribute "<?xml ... encoding="..." ?> to discover the
encoding and assume utf-8 otherwise as per spec:
http://www.w3.org/TR/2000/REC-xml-20001006#NT-EncodingDecl

Does bash natively support utf-8?  Is there a bash equivalent to Python
coding: directives?  You may be attempting to fix a problem that doesn't
exist.

> Yeah, see, at a business level, I really need to process those all in 
> the same way, and it would be annoying to have to write code to handle 
> them all differently.

So you, or anyone else, can write a module for discovering the encoding
used for a particular file based on XML tags, Python coding: directives,
etc. It could include an extensible registry, and if it is used enough,
could be included in the Python standard library.

 - Josiah