[Python-Dev] Unicode byte order mark decoding
Stephen J. Turnbull
stephen at xemacs.org
Wed Apr 6 02:32:01 CEST 2005
>>>>> "Martin" == Martin v Löwis <martin at v.loewis.de> writes:
Martin> So people do use the "decode-it-all" mode, where no
Martin> sequential access is necessary - yet the beginning of the
Martin> string is still the beginning of what once was a
Martin> stream. This case must be supported.
Of course it must be supported. My point is that many strings (in my
applications, all but those strings that result from slurping in a
file or process output in one go -- example, not a statistically valid
sample!) are not the beginning of "what once was a stream". It is
error-prone (not to mention unaesthetic) to not make that distinction.
"Explicit is better than implicit."
Martin> Whether or not to use the codec would be the application's
Martin> choice.
>> What I think should be provided is a stateful object
>> encapsulating the codec. Ie, to avoid the need to write
>> out = chunk[0].encode("utf-8-sig") + chunk[1].encode("utf-8")
Martin> No. People who want streaming should use cStringIO, i.e.
>>> s=cStringIO.StringIO()
>>> s1=codecs.getwriter("utf-8")(s)
>>> s1.write(u"Hallo")
>>> s.getvalue()
'Hallo'
Yes! Exactly (except in reverse, we want to _read_ from the slurped
stream-as-string, not write to one)! ... and there's no need for a
utf-8-sig codec for strings, since you can support the usage in
exactly this way.
--
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Ask not how you can "do" free software business;
ask what your business can "do for" free software.
More information about the Python-Dev
mailing list