[XML-SIG] Re: [4suite] Output encodings again

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Tue, 12 Sep 2000 00:48:15 +0200


[for i18n readers: the issue is to convert u"\u000A9\u01A9" to latin-1,
 so that it comes out as "\251&#1A9;"]

> Currently, on output to XML (and HTML), we first convert the UTF-8 that
> the DOM uses into Martin von Lowis's wchar type.  

It may be the time to slowly retire this type. It is still needed for
1.5 installations, but the 1.6/2.0 type has a comparable feature set
yet an interface that is here to stay; plus it offers quite some
additional feature.

Still, I believe it shares this problem with my type.

> So I'm rather at a loss as to how to efficiently escape such characters
> for XML output.  I know I want to render them as &#???;, but every
> method I see for doing so is rather wasteful.

In principle, the approach should be introduce new encodings. That is,
you get latin-1-xml, latin-2-xml, koi-8r-xml, utf-8-xml, and so on.

These encodings are the same as the original ones, except that they
have different error handling. This approach is possible both with my
type and with the 2.0 type - however, implementing these encodings is
quite some effort.

I'm sure you've thought of the approach to catch the exception, then
retry with a smaller string. That may not be too bad - it requires a
binary search to work efficiently. E.g.

def latin1_xml(str):
    try:
        result = result + str.encode("latin-1")
    except UnicodeError:
        if len(str)==1:
            return "&%x;" % ord(str)
        m = len(str)/2
        return latin1_xml(str[:m]) + latin1_xml(str[m:])

It could be implemented more efficiently if the UnicodeError told at
what offset exactly the problem occured, or at least what character
was causing the problem, e.g.

def latin1_xml(str):
    try:
        result = result + str.encode("latin-1")
    except UnicodeError,e:
        if len(str)==1:
            return "&%x;" % ord(str)
        m = str.find(e.bad_char)
	r = "&%x;" % e.bad_char
        return latin1_xml(str[:m-1]) + r + % e.bad_charlatin1_xml(str[m+1:])

I think such an advanced error reporting could be useful; it is
questionable whether it could go into 2.0 if implemented. In any case,
it would probably be reasonable not to require a bad_char attribute in
every UnicodeError instance - perhaps UnicodeError must be further
subclassed:

def latin1_xml(str):
    try:
        result = result + str.encode("latin-1")
    except ConversionError,e:
        m = e.offset
	r = "&%x;" % e.bad_char
        return latin1_xml(str[:m-1]) + r + % e.bad_charlatin1_xml(str[m+1:])
    except UnicodeError,e:
        if len(str)==1:
            return "&%x;" % ord(str)
        m = len(str)/2
        return latin1_xml(str[:m]) + latin1_xml(str[m:])

Regards,
Martin