[I18n-sig] XML and codecs

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Tue, 5 Jun 2001 19:26:15 +0200


> How would UTF-16 be handled? I guess without additional
> code multiple BOMs would be generated for a string that
> contains unencodable characters.

When you generate or decode UTF-16, this is not a problem: There won't
be any unencodable characters.

Even if that was a problem: Just by raising the exception, there won't
be multiple BOMs. So you have to provide additional code, anyway, so
you better make sure this code is correct.

The problem becomes real for codecs that preserve state: You'll need
to maintain the state of the codec from the time the exception
occurred, so that subsequence .encode calls will continue in the shift
state they were in previously.

So for codecs that preserve state across .encode calls, codecs.lookup
will need to return a bound method as encode and decode function, not
a simple function; see the iconv codec for an example.

In some sense, one can argue that the UTF-16 Codec also preserves
state: whether it has yet emitted a BOM.

Regards,
Martin