[Python-Dev] Unicode byte order mark decoding
walter at livinglogic.de
Wed Apr 6 13:48:48 CEST 2005
Stephen J. Turnbull wrote:
>>>>>>"Martin" == Martin v Löwis <martin at v.loewis.de> writes:
> Martin> I can't put these two paragraphs together. If you think
> Martin> that explicit is better than implicit, why do you not want
> Martin> to make different calls for the first chunk of a stream,
> Martin> and the subsequent chunks?
> Because the signature/BOM is not a chunk, it's a header. Handling the
> signature/BOM is part of stream initialization, not translation, to my
> The point is that explicitly using a stream shows that initialization
> (and finalization) matter. The default can be BOM or not, as a
> pragmatic matter. But then the stream data itself can be treated
> homogeneously, as implied by the notion of stream.
> I think it probably also would solve Walter's conundrum about
> buffering the signature/BOM if responsibility for that were moved out
> of the codecs and into the objects where signatures make sense.
Not really. In every encoding where a sequence of more than one byte
maps to one Unicode character, you will always need some kind of
buffering. If we remove the handling of initial BOMs from the codecs
(except for UTF-16 where it is required), this wouldn't change any
> I don't know whether that's really feasible in the short run---I
> suspect there may be a lot of stream-like modules that would need to
> be updated---but it would be a saner in the long run.
I'm not exactly sure, what you're proposing here. That all codecs (even
UTF-16) pass the BOM through and some other infrastructure is
responsible for dropping it?
More information about the Python-Dev