[Python-Dev] Unicode byte order mark decoding

"Martin v. Löwis" martin at v.loewis.de
Tue Apr 5 20:44:47 CEST 2005


Stephen J. Turnbull wrote:
>     Martin> With the UTF-8-SIG codec, it would apply to all operation
>     Martin> modes of the codec, whether stream-based or from strings.
> 
> I had in mind the ability to treat a string as a stream.

Hmm. A string is not a stream, but it could be the contents of a stream.

A typical application of codecs goes like this:

data = stream.read()
[analyze data, e.g. by checking whether there is encoding= in <?xml...]
data = data.decode(encoding analyzed)

So people do use the "decode-it-all" mode, where no sequential access
is necessary - yet the beginning of the string is still the beginning of
what once was a stream. This case must be supported.

>     Martin> Whether or not to use the codec would be the application's
>     Martin> choice.
> 
> What I think should be provided is a stateful object encapsulating the
> codec.  Ie, to avoid the need to write
> 
>     out = chunk[0].encode("utf-8-sig") + chunk[1].encode("utf-8")

No. People who want streaming should use cStringIO, i.e.

 >>> s=cStringIO.StringIO()
 >>> s1=codecs.getwriter("utf-8")(s)
 >>> s1.write(u"Hallo")
 >>> s.getvalue()
'Hallo'

Regards,
Martin


More information about the Python-Dev mailing list