[Python-Dev] Codecs and StreamCodecs

M.-A. Lemburg mal@lemburg.com
Thu, 18 Nov 1999 09:50:36 +0100


Fredrik Lundh wrote:
> 
> M.-A. Lemburg <mal@lemburg.com> wrote:
> > >     def flush(self):
> > >         # flush the decoding buffers.  this should usually
> > >         # return None, unless the fact that knowing that the
> > >         # input stream has ended means that the state can be
> > >         # interpreted in a meaningful way.  however, if the
> > >         # state indicates that there last character was not
> > >         # finished, this method should raise a UnicodeError
> > >         # exception.
> >
> > Could you explain for reason for having a .flush() method
> > and what it should return.
> 
> in most cases, it should either return None, or
> raise a UnicodeError exception:
> 
>     >>> u = unicode("å i åa ä e ö", "iso-latin-1")
>     >>> # yes, that's a valid Swedish sentence ;-)
>     >>> s = u.encode("utf-8")
>     >>> d = decoder("utf-8")
>     >>> d.decode(s[:-1])
>     "å i åa ä e "
>     >>> d.flush()
>     UnicodeError: last character not complete
> 
> on the other hand, there are situations where it
> might actually return a string.  consider a "HTML
> entity decoder" which uses the following pattern
> to match a character entity: "&\w+;?" (note that
> the trailing semicolon is optional).
> 
>     >>> u = unicode("å i åa ä e ö", "iso-latin-1")
>     >>> s = u.encode("html-entities")
>     >>> d = decoder("html-entities")
>     >>> d.decode(s[:-1])
>     "å i åa ä e "
>     >>> d.flush()
>     "ö"

Ah, ok. So the .flush() method checks for proper
string endings and then either returns the remaining
input or raises an error.
 
> > Perhaps I'm missing something, but how would you define
> > stream codecs using this interface ?
> 
> input: read chunks of data, decode, and
> keep extra data in a local buffer.
> 
> output: encode data into suitable chunks,
> and write to the output stream (that's why
> there's a buffersize argument to encode --
> if someone writes a 10mb unicode string to
> an encoded stream, python shouldn't allocate
> an extra 10-30 megabytes just to be able to
> encode the darn thing...)

So the stream codecs would be wrappers around the
string codecs.

Have you read my latest version of the Codec interface ?
Wouldn't that be a reasonable approach ? Note that I have
integrated your ideas into the new API -- it's basically
only missing the .flush() methods, which I can add now
that I know what you meant.
 
> > > Implementing stream codecs is left as an exercise (see the zlib
> > > material in the eff-bot guide for a decoder example).
> 
> everybody should have a copy of the eff-bot guide ;-)

Sure, but the format, the format... make it printed and add
a CD and you would probably have a good selling book
there ;-)
 
> (but alright, I plan to post a complete utf-8 implementation
> in a not too distant future).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/