Is it possible to consume UTF8 XML documents using xml.dom.pulldom?

Wed Jul 30 11:43:58 EDT 2008

On 30 Jul, 16:32, Simon Willison <si... at simonwillison.net> wrote:
> I'm having a horrible time trying to get xml.dom.pulldom to consume a
> UTF8 encoded XML file. Here's what I've tried so far:
>
> >>> xml_utf8 = """<?xml version="1.0" encoding="UTF-8" ?>
>
> <msg>Simon\xe2\x80\x99s XML nightmare</msg>
> """>>> from xml.dom import pulldom
> >>> parser = pulldom.parseString(xml_utf8)
> >>> parser.next()
>
> ('START_DOCUMENT', <xml.dom.minidom.Document instance at 0x6f06c0>)>>> parser.next()
>
> ('START_ELEMENT', <DOM Element: msg at 0x6f0710>)>>> parser.next()
>
> ...
> UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in
> position 21: ordinal not in range(128)

I can't reproduce this on Python 2.3.6 or 2.4.4 on RHEL 4. Instead, I
get the usual...

('CHARACTERS', <DOM Text node "Simon\u2019s XM...">)

And I can get the content of the text node as a proper Unicode object.

[...]

> Is it possible to consume utf8 or unicode using xml.dom.pulldom or
> should I try something else?

Yes, it is possible, at least in Python 2.3.6 and 2.4.4 configured
with --enable-unicode=ucs4 (which is what Red Hat does and expects).

Paul

P.S. You shouldn't try and pass Unicode to the parser, since XML
parsing in its entirety deals with byte sequences and character
encodings, although I suppose that there's some kind of character-
based (ie. Unicode value-based) parsing method defined somewhere by
some committee or other.