On Wed, 28 Jul 2004 11:38:16 +0200, Walter Dörwald email@example.com wrote:
Hye-Shik Chang wrote:
On Tue, 27 Jul 2004 22:39:45 +0200, Walter Dörwald firstname.lastname@example.org wrote:
Pythons unicode machinery currently has problems when decoding incomplete input.
When codecs.StreamReader.read() encounters a decoding error it reads more bytes from the input stream and retries decoding. This is broken for two reasons:
- The error might be due to a malformed byte sequence in the input, a problem that can't be fixed by reading more bytes.
- There may be no more bytes available at this time. Once more data is available decoding can't continue because bytes from the input stream have already been read and thrown away.
(sio.DecodingInputFilter has the same problems)
StreamReaders and -Writers from CJK codecs are not suffering from this problems because they have internal buffer for keeping states and incomplete bytes of a sequence. In fact, CJK codecs has its own implementation for UTF-8 and UTF-16 on base of its multibytecodec system. It provides a "working" StreamReader/Writer already. :)
Seems you had the same problems with the builtin stream readers! ;)
BTW, how do you solve the problem that incomplete byte sequences are retained in the middle of a stream, but should generate errors at the end?
Rough pseudo code here: (it's written in C in CJKCodecs)
pending = '' # incomplete
def read(self, size=-1): while True: r = fp.read(size) if self.pending: r = self.pending + r self.pending = ''
if r: try: outputbuffer = r.decode('utf-8') except MBERR_TOOFEW: # incomplete multibyte sequence pass except MBERR_ILLSEQ: # illegal sequence raise UnicodeDecodeError, "illseq"
if not r or size == -1: # end of the stream if r have not consumed up for the output: raise UnicodeDecodeError, "toofew"
if r have not consumed up for the output: self.pending = remainders of r
if (size == -1 or # one time read up len(outputbuffer) > 0 or # output buffer isn't empty original length of r == 0): # the end of the stream break
size = 1 # read 1 byte in next try
CJKcodecs' multibytecodec structure has distinguished internal error codes for "illegal sequence" and "incomplete sequence". And each internal codecs receive a flag that indicates if immediate flush is needed at time. (for the end of streams and simple decode functions)