[Python-Dev] Decoding incomplete unicode

Wed Jul 28 14:46:47 CEST 2004

On Wed, 28 Jul 2004 11:38:16 +0200, Walter Dörwald
<walter at livinglogic.de> wrote:
> Hye-Shik Chang wrote:
> 
> > On Tue, 27 Jul 2004 22:39:45 +0200, Walter Dörwald
> > <walter at livinglogic.de> wrote:
> >
> >>Pythons unicode machinery currently has problems when decoding
> >>incomplete input.
> >>
> >>When codecs.StreamReader.read() encounters a decoding error it
> >>reads more bytes from the input stream and retries decoding.
> >>This is broken for two reasons:
> >>1) The error might be due to a malformed byte sequence in the input,
> >>    a problem that can't be fixed by reading more bytes.
> >>2) There may be no more bytes available at this time. Once more
> >>    data is available decoding can't continue because bytes from
> >>    the input stream have already been read and thrown away.
> >>(sio.DecodingInputFilter has the same problems)
> >
> > StreamReaders and -Writers from CJK codecs are not suffering from
> > this problems because they have internal buffer for keeping states
> > and incomplete bytes of a sequence. In fact, CJK codecs has its
> > own implementation for UTF-8 and UTF-16 on base of its multibytecodec
> > system.  It provides a "working" StreamReader/Writer already. :)
> 
> Seems you had the same problems with the builtin stream readers! ;)
> 
> BTW, how do you solve the problem that incomplete byte sequences
> are retained in the middle of a stream, but should generate errors
> at the end?
> 

Rough pseudo code here: (it's written in C in CJKCodecs)

class StreamReader:

    pending = '' # incomplete 

    def read(self, size=-1):
        while True:
            r = fp.read(size)
            if self.pending:
                r = self.pending + r
                self.pending = ''

            if r:
                try:
                    outputbuffer = r.decode('utf-8')
                except MBERR_TOOFEW: # incomplete multibyte sequence
                    pass
                except MBERR_ILLSEQ: # illegal sequence
                    raise UnicodeDecodeError, "illseq"

            if not r or size == -1: # end of the stream
                if r have not consumed up for the output:
                    raise UnicodeDecodeError, "toofew"

            if r have not consumed up for the output:
                self.pending = remainders of r

            if (size == -1 or               # one time read up
                len(outputbuffer) > 0 or    # output buffer isn't empty
                original length of r == 0): # the end of the stream
                    break

            size = 1 # read 1 byte in next try

        return outputbuffer

CJKcodecs' multibytecodec structure has distinguished internal error
codes for "illegal sequence" and "incomplete sequence".  And each
internal codecs receive a flag that indicates if immediate flush
is needed at time.  (for the end of streams and simple decode functions)

Regards,
Hye-Shik