Hye-Shik Chang wrote:
BTW, how do you solve the problem that incomplete byte sequences are retained in the middle of a stream, but should generate errors at the end?
Rough pseudo code here: (it's written in C in CJKCodecs)
pending = '' # incomplete def read(self, size=-1): while True: r = fp.read(size) if self.pending: r = self.pending + r self.pending = '' if r: try: outputbuffer = r.decode('utf-8') except MBERR_TOOFEW: # incomplete multibyte sequence pass except MBERR_ILLSEQ: # illegal sequence raise UnicodeDecodeError, "illseq" if not r or size == -1: # end of the stream if r have not consumed up for the output: raise UnicodeDecodeError, "toofew"
Here's the problem: I'd like the streamreader to be able to continue even when there is no input available *now*. Perhaps there should be an additional argument to read() named final? If final is true, the stream reader makes sure that all pending bytes have been used up.
if r have not consumed up for the output: self.pending = remainders of r if (size == -1 or # one time read up len(outputbuffer) > 0 or # output buffer isn't empty original length of r == 0): # the end of the stream break size = 1 # read 1 byte in next try return outputbuffer
CJKcodecs' multibytecodec structure has distinguished internal error codes for "illegal sequence" and "incomplete sequence". And each internal codecs receive a flag that indicates if immediate flush is needed at time. (for the end of streams and simple decode functions)
Bye, Walter Dörwald