[XML-SIG] Re: checking a string for well-formedness

Fredrik Lundh fredrik@pythonware.com
Thu, 8 May 2003 11:54:57 +0200


Paul Tremblay wrote:

> import xml.parsers.expat
> parser = xml.parsers.expat.ParserCreate()
> import sys
>
> def validate(data):
>     parser.Parse(data)
>     try:
>         parser.Parse(data)
>         return 0
>     except xml.parsers.expat.ExpatError:
>         sys.stderr.write('tagging text will result in invalid XML\n')
>         return 1
>
> data = '<doc><tag>text</tag><tag>text,</tag></doc>'
> validate(data)
>
> The function validate returns 0 in this case.

or raise an exception, if you don't remove the first call to
parser.Parse(data).

unfortunately, even if you remove that line, the function may
still return 0 for invalid XML snippets, e.g:

> data = '<doc><tag>text</tag><tag>text,</tag>'

to fix this, you have to tell the parser that you won't call
it again with more data:

    parser.Parse(data, 1)

> However, if I try this:
>
> data = u'<doc><tag>text</tag><tag>text\u201c</tag></doc>'
>
> I get the following error:
>
> Traceback (most recent call last):
>   File "/home/paul/lib/python/paul/xml/expat.py", line 50, in ?
>     parser.Parse(data)
> UnicodeError: ASCII encoding error: ordinal not in range(128)
>
> Any idea what is going on here?

the parse function requires an 8-bit string, and Python defaults
to ASCII when converting Unicode to 8-bit data.

the simplest way to work around this is to convert the string to
the XML default encoding (utf-8) on the way in:

def validate(data):
    try:
        if isinstance(data, type(u"")):
            data = data.encode("utf-8")
        parser.Parse(data, 1)
        return 0
    except xml.parsers.expat.ExpatError:
        sys.stderr.write('tagging text will result in invalid XML\n')
        return 1

</F>