[Python-Dev] Unicode byte order mark decoding

Stephen J. Turnbull stephen at xemacs.org
Wed Apr 6 02:32:01 CEST 2005


>>>>> "Martin" == Martin v Löwis <martin at v.loewis.de> writes:

    Martin> So people do use the "decode-it-all" mode, where no
    Martin> sequential access is necessary - yet the beginning of the
    Martin> string is still the beginning of what once was a
    Martin> stream. This case must be supported.

Of course it must be supported.  My point is that many strings (in my
applications, all but those strings that result from slurping in a
file or process output in one go -- example, not a statistically valid
sample!) are not the beginning of "what once was a stream".  It is
error-prone (not to mention unaesthetic) to not make that distinction.

"Explicit is better than implicit."

    Martin> Whether or not to use the codec would be the application's
    Martin> choice.

    >> What I think should be provided is a stateful object
    >> encapsulating the codec.  Ie, to avoid the need to write

    >> out = chunk[0].encode("utf-8-sig") + chunk[1].encode("utf-8")

    Martin> No. People who want streaming should use cStringIO, i.e.

 >>> s=cStringIO.StringIO()
 >>> s1=codecs.getwriter("utf-8")(s)
 >>> s1.write(u"Hallo")
 >>> s.getvalue()
'Hallo'

Yes!  Exactly (except in reverse, we want to _read_ from the slurped
stream-as-string, not write to one)!  ... and there's no need for a
utf-8-sig codec for strings, since you can support the usage in
exactly this way.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.


More information about the Python-Dev mailing list