[Python-Dev] Unicode byte order mark decoding

Stephen J. Turnbull stephen at xemacs.org
Wed Apr 6 11:31:21 CEST 2005

>>>>> "Martin" == Martin v Löwis <martin at v.loewis.de> writes:

    Martin> I can't put these two paragraphs together. If you think
    Martin> that explicit is better than implicit, why do you not want
    Martin> to make different calls for the first chunk of a stream,
    Martin> and the subsequent chunks?

Because the signature/BOM is not a chunk, it's a header.  Handling the
signature/BOM is part of stream initialization, not translation, to my

The point is that explicitly using a stream shows that initialization
(and finalization) matter.  The default can be BOM or not, as a
pragmatic matter.  But then the stream data itself can be treated
homogeneously, as implied by the notion of stream.

I think it probably also would solve Walter's conundrum about
buffering the signature/BOM if responsibility for that were moved out
of the codecs and into the objects where signatures make sense.

I don't know whether that's really feasible in the short run---I
suspect there may be a lot of stream-like modules that would need to
be updated---but it would be a saner in the long run.

    >> Yes!  Exactly (except in reverse, we want to _read_ from the
    >> slurped stream-as-string, not write to one)!  ... and there's
    >> no need for a utf-8-sig codec for strings, since you can
    >> support the usage in exactly this way.

    Martin> However, if there is an utf-8-sig codec for streams, there
    Martin> is currently no way of *preventing* this codec to also be
    Martin> available for strings. The very same code is used for
    Martin> streams and for strings, and automatically so.

And of course it should be.  But if it's not possible to move the -sig
facility out of the codecs into the streams, that would be a shame.  I
think we should encourage people to use streams where initialization or
finalization semantics are non-trivial, as they are with signatures.

But as long as both utf-8-we-dont-need-no-steenkin-sigs-in-strings and
utf-8-sig are available, I can program as I want to (and refer those
whose strings get cratered by stray BOMs to you<wink>).

School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.

More information about the Python-Dev mailing list