[XML-SIG] unicode

Mark McEahern marklists@mceahern.com
Fri, 10 Aug 2001 10:23:55 -0700


>If you have the need to parse Unicode strings, I'd recommend to encode
>them first. If you have an encoding declaration in the document, you
>should encode them using that declaration; otherwise you should encode
>them as UTF-8.

Yes, that's the wordaround I've been using:

	foo = u'<foo/>'
	foo = foo.encode('utf-8')
	fooDoc = parseString(foo)

By the way, I discovered a clearer demonstration of the problem:

	import xml.dom.minidom

	foo = '<foo/>'
	fooDoc = xml.dom.minidom.parseString(foo)
	fooXml = fooDoc.toxml()
	try:
	    fooDoc2 = xml.dom.minidom.parseString(fooXml)
	except TypeError:
	    print 'Round-tripping failed.'

>If you can come up with a patch that gets this right, it would be much
>appreciated.

Well, I noticed the error was happening in pulldom.py:

	  File "c:\python21\_xmlplus\dom\pulldom.py", line 316, in parseString
	    buf = StringIO(string)
	TypeError: expected string, unicode found

parseString looks like this:

	def parseString(string, parser=None):
	    try:
	        from cStringIO import StringIO
	    except ImportError:
	        from StringIO import StringIO

	    bufsize = len(string)
	    buf = StringIO(string)
	    if not parser:
	        parser = xml.sax.make_parser()
	    return DOMEventStream(buf, parser, bufsize)

I thought, "Gee, what if I force this to use StringIO instead of
cStringIO..."

Guess what?  It worked without manually encoding the string as 'utf-8'.
Does that mean the problem is in cStringIO?  I guess I'm still not clear on
what the exact problem is.  Is it that cStringIO only accepts ... what?

Thanks,

// mark