[XML-SIG] Re: checking a string for well-formedness
Fredrik Lundh
fredrik@pythonware.com
Thu, 8 May 2003 11:54:57 +0200
Paul Tremblay wrote:
> import xml.parsers.expat
> parser = xml.parsers.expat.ParserCreate()
> import sys
>
> def validate(data):
> parser.Parse(data)
> try:
> parser.Parse(data)
> return 0
> except xml.parsers.expat.ExpatError:
> sys.stderr.write('tagging text will result in invalid XML\n')
> return 1
>
> data = '<doc><tag>text</tag><tag>text,</tag></doc>'
> validate(data)
>
> The function validate returns 0 in this case.
or raise an exception, if you don't remove the first call to
parser.Parse(data).
unfortunately, even if you remove that line, the function may
still return 0 for invalid XML snippets, e.g:
> data = '<doc><tag>text</tag><tag>text,</tag>'
to fix this, you have to tell the parser that you won't call
it again with more data:
parser.Parse(data, 1)
> However, if I try this:
>
> data = u'<doc><tag>text</tag><tag>text\u201c</tag></doc>'
>
> I get the following error:
>
> Traceback (most recent call last):
> File "/home/paul/lib/python/paul/xml/expat.py", line 50, in ?
> parser.Parse(data)
> UnicodeError: ASCII encoding error: ordinal not in range(128)
>
> Any idea what is going on here?
the parse function requires an 8-bit string, and Python defaults
to ASCII when converting Unicode to 8-bit data.
the simplest way to work around this is to convert the string to
the XML default encoding (utf-8) on the way in:
def validate(data):
try:
if isinstance(data, type(u"")):
data = data.encode("utf-8")
parser.Parse(data, 1)
return 0
except xml.parsers.expat.ExpatError:
sys.stderr.write('tagging text will result in invalid XML\n')
return 1
</F>