[XML-SIG] Re: Re: checking a string for well-formedness

Fredrik Lundh fredrik@pythonware.com
Fri, 9 May 2003 00:16:26 +0200


(please don't top-post)

Paul Tremblay wrote:

> > the parse function requires an 8-bit string, and Python defaults
> > to ASCII when converting Unicode to 8-bit data.
>
> I must be dense when it comes to unicode. So Python converts unicode
> to a 7-bit (ASCII) string?

if you're using a Unicode string where Python expects an 8-bit
string, Python refuses to guess, and raises an exception if the
Unicode string contains anything that's not plain ASCII.

> You solution worked, but then I immediately ame up  ith a new problem
> when I tried to test the speed of this funciton:
>
> # assume the same exact funtion from below, which I cut and pasted
> for j in range(10):
>     data = u'<doc><tag>text\u201c</tag><tag>thext,</tag></doc>'
>     validate(data)
>
> The first time the string is tested, it comes out as valid. But every
> single instance afterwards comes out all ill-formed XML.

You have to create a new parser for each run (my mistake; I'd already
fixed two bugs in your code, and missed the third one ;-)

> > def validate(data):
> >     try:
> >         if isinstance(data, type(u"")):
> >             data = data.encode("utf-8")

+ +         parser = xml.parsers.expat.ParserCreate()

> >         parser.Parse(data, 1)
> >         return 0
> >     except xml.parsers.expat.ExpatError:
> >         sys.stderr.write('tagging text will result in invalid XML\n')
> >         return 1

</F>