[XML-SIG] Re: Re: checking a string for well-formedness
Fredrik Lundh
fredrik@pythonware.com
Fri, 9 May 2003 00:16:26 +0200
(please don't top-post)
Paul Tremblay wrote:
> > the parse function requires an 8-bit string, and Python defaults
> > to ASCII when converting Unicode to 8-bit data.
>
> I must be dense when it comes to unicode. So Python converts unicode
> to a 7-bit (ASCII) string?
if you're using a Unicode string where Python expects an 8-bit
string, Python refuses to guess, and raises an exception if the
Unicode string contains anything that's not plain ASCII.
> You solution worked, but then I immediately ame up ith a new problem
> when I tried to test the speed of this funciton:
>
> # assume the same exact funtion from below, which I cut and pasted
> for j in range(10):
> data = u'<doc><tag>text\u201c</tag><tag>thext,</tag></doc>'
> validate(data)
>
> The first time the string is tested, it comes out as valid. But every
> single instance afterwards comes out all ill-formed XML.
You have to create a new parser for each run (my mistake; I'd already
fixed two bugs in your code, and missed the third one ;-)
> > def validate(data):
> > try:
> > if isinstance(data, type(u"")):
> > data = data.encode("utf-8")
+ + parser = xml.parsers.expat.ParserCreate()
> > parser.Parse(data, 1)
> > return 0
> > except xml.parsers.expat.ExpatError:
> > sys.stderr.write('tagging text will result in invalid XML\n')
> > return 1
</F>