[Python-Dev] Codecs and StreamCodecs
Fredrik Lundh
fredrik@pythonware.com
Wed, 17 Nov 1999 12:00:10 +0100
M.-A. Lemburg <mal@lemburg.com> wrote:
> > def flush(self):
> > # flush the decoding buffers. this should usually
> > # return None, unless the fact that knowing that the
> > # input stream has ended means that the state can be
> > # interpreted in a meaningful way. however, if the
> > # state indicates that there last character was not
> > # finished, this method should raise a UnicodeError
> > # exception.
>
> Could you explain for reason for having a .flush() method
> and what it should return.
in most cases, it should either return None, or
raise a UnicodeError exception:
>>> u = unicode("å i åa ä e ö", "iso-latin-1")
>>> # yes, that's a valid Swedish sentence ;-)
>>> s = u.encode("utf-8")
>>> d = decoder("utf-8")
>>> d.decode(s[:-1])
"å i åa ä e "
>>> d.flush()
UnicodeError: last character not complete
on the other hand, there are situations where it
might actually return a string. consider a "HTML
entity decoder" which uses the following pattern
to match a character entity: "&\w+;?" (note that
the trailing semicolon is optional).
>>> u = unicode("å i åa ä e ö", "iso-latin-1")
>>> s = u.encode("html-entities")
>>> d = decoder("html-entities")
>>> d.decode(s[:-1])
"å i åa ä e "
>>> d.flush()
"ö"
> Perhaps I'm missing something, but how would you define
> stream codecs using this interface ?
input: read chunks of data, decode, and
keep extra data in a local buffer.
output: encode data into suitable chunks,
and write to the output stream (that's why
there's a buffersize argument to encode --
if someone writes a 10mb unicode string to
an encoded stream, python shouldn't allocate
an extra 10-30 megabytes just to be able to
encode the darn thing...)
> > Implementing stream codecs is left as an exercise (see the zlib
> > material in the eff-bot guide for a decoder example).
everybody should have a copy of the eff-bot guide ;-)
(but alright, I plan to post a complete utf-8 implementation
in a not too distant future).
</F>