[Python-Dev] Unicode byte order mark decoding

Walter Dörwald walter at livinglogic.de
Wed Apr 6 13:48:48 CEST 2005


Stephen J. Turnbull wrote:
>>>>>>"Martin" == Martin v Löwis <martin at v.loewis.de> writes:
> 
>     Martin> I can't put these two paragraphs together. If you think
>     Martin> that explicit is better than implicit, why do you not want
>     Martin> to make different calls for the first chunk of a stream,
>     Martin> and the subsequent chunks?
> 
> Because the signature/BOM is not a chunk, it's a header.  Handling the
> signature/BOM is part of stream initialization, not translation, to my
> mind.
> 
> The point is that explicitly using a stream shows that initialization
> (and finalization) matter.  The default can be BOM or not, as a
> pragmatic matter.  But then the stream data itself can be treated
> homogeneously, as implied by the notion of stream.
> 
> I think it probably also would solve Walter's conundrum about
> buffering the signature/BOM if responsibility for that were moved out
> of the codecs and into the objects where signatures make sense.

Not really. In every encoding where a sequence of more than one byte 
maps to one Unicode character, you will always need some kind of 
buffering. If we remove the handling of initial BOMs from the codecs 
(except for UTF-16 where it is required), this wouldn't change any 
buffering requirements.

> I don't know whether that's really feasible in the short run---I
> suspect there may be a lot of stream-like modules that would need to
> be updated---but it would be a saner in the long run.

I'm not exactly sure, what you're proposing here. That all codecs (even 
UTF-16) pass the BOM through and some other infrastructure is 
responsible for dropping it?

> [...]

Bye,
    Walter Dörwald


More information about the Python-Dev mailing list