[Python-Dev] Codecs and StreamCodecs
M.-A. Lemburg
mal@lemburg.com
Thu, 18 Nov 1999 09:50:36 +0100
Fredrik Lundh wrote:
>
> M.-A. Lemburg <mal@lemburg.com> wrote:
> > > def flush(self):
> > > # flush the decoding buffers. this should usually
> > > # return None, unless the fact that knowing that the
> > > # input stream has ended means that the state can be
> > > # interpreted in a meaningful way. however, if the
> > > # state indicates that there last character was not
> > > # finished, this method should raise a UnicodeError
> > > # exception.
> >
> > Could you explain for reason for having a .flush() method
> > and what it should return.
>
> in most cases, it should either return None, or
> raise a UnicodeError exception:
>
> >>> u = unicode("å i åa ä e ö", "iso-latin-1")
> >>> # yes, that's a valid Swedish sentence ;-)
> >>> s = u.encode("utf-8")
> >>> d = decoder("utf-8")
> >>> d.decode(s[:-1])
> "å i åa ä e "
> >>> d.flush()
> UnicodeError: last character not complete
>
> on the other hand, there are situations where it
> might actually return a string. consider a "HTML
> entity decoder" which uses the following pattern
> to match a character entity: "&\w+;?" (note that
> the trailing semicolon is optional).
>
> >>> u = unicode("å i åa ä e ö", "iso-latin-1")
> >>> s = u.encode("html-entities")
> >>> d = decoder("html-entities")
> >>> d.decode(s[:-1])
> "å i åa ä e "
> >>> d.flush()
> "ö"
Ah, ok. So the .flush() method checks for proper
string endings and then either returns the remaining
input or raises an error.
> > Perhaps I'm missing something, but how would you define
> > stream codecs using this interface ?
>
> input: read chunks of data, decode, and
> keep extra data in a local buffer.
>
> output: encode data into suitable chunks,
> and write to the output stream (that's why
> there's a buffersize argument to encode --
> if someone writes a 10mb unicode string to
> an encoded stream, python shouldn't allocate
> an extra 10-30 megabytes just to be able to
> encode the darn thing...)
So the stream codecs would be wrappers around the
string codecs.
Have you read my latest version of the Codec interface ?
Wouldn't that be a reasonable approach ? Note that I have
integrated your ideas into the new API -- it's basically
only missing the .flush() methods, which I can add now
that I know what you meant.
> > > Implementing stream codecs is left as an exercise (see the zlib
> > > material in the eff-bot guide for a decoder example).
>
> everybody should have a copy of the eff-bot guide ;-)
Sure, but the format, the format... make it printed and add
a CD and you would probably have a good selling book
there ;-)
> (but alright, I plan to post a complete utf-8 implementation
> in a not too distant future).
--
Marc-Andre Lemburg
______________________________________________________________________
Y2000: 43 days left
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/