[XML-SIG] HTML<->UTF-8 'codec'?

Martin v. Loewis Martin.v.Loewis@t-online.de
Sat, 20 Oct 2001 11:11:39 +0200


> Perhaps you'd be kind enough to review my sample code at
> ftp://ftp.parc.xerox.com/transient/janssen/htmlcodec.py, and advise of
> glaring errors or any interesting improvements that occur to you?

Hi Bill,

The script looks quite alright, AFAICT. There is one issue of
correctness: If you also want to support XHTML, you may encounter
CDATA sections, in which case "&" does not denote markup.

Another correctness issue is the role of UTF-8 here; it appears that
your Codec does not deal with UTF-8 at all. On encoding, there
wouldn't be any need to ever use HTML entities, since you could encode
everything as UTF-8. Not doing so is fine - except that you could then
declare that the output is US-ASCII as well. On decoding, you might
need to pay attention to UTF-8. While doing so, it is advisable not to
mix Unicode and byte strings in a single operation. E.g. when you
write

  if input[i] == u'&'

then I believe input is a byte string, so this would be better

  if input[i] == u'&'

The former will fail if ord(input[i])>127.

There is a (perhaps more important) issue of efficiency: Building up a
large string by adding a character at a time is not particularly
efficient. Assuming that non-ASCII characters are rare, you may try to
find large substrings of your input that need no processing. For
example, in decode, doing input.find("&") might be better: you can add
large chunks of input to the output.

Even with these improvements, building up a string by adding tail
segments requires repeated copying of the string head. This can be
avoided with a pre-allocated list L, using string.join(L,"") when
done.

HTH,
Martin