[I18n-sig] XML and codecs

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Wed, 6 Jun 2001 20:33:07 +0200


> > Well, either the codec keeps state or your application;
> > here's some pseudo code to illustrate the first situation:
> > 
> > def do_something(data):
> > 
> >     StreamWriter = codec.lookup('myencoding')[3]
> >     output = cStringIO(data)
> >     writer = StreamWriter(output, 'break')
> >     while 1:
> >         try:
> >             writer.write(data)
> >         except UnicodeBreakError, (reason, position, work):
> >             # Write data converted so far
> >             output.write(work)
> >             # Roll back 10 chars in the input and retry
> >             data = data[position - 10:]
> >         else:
> >             break
> >     return output.getvalue()

I've missed Marc's posting of this code fragment: How can rolling back
10 characters possibly be the right thing? Couldn't this cause data to
be written twice to the stream?

I would expect that, when calling .write(), all correctly encoded data
is written to the stream and that position points to the first
character that cannot be encoded.

Regards,
Martin