[XML-SIG] Unicode support problems in parsers

Wed, 14 Feb 2001 13:09:51 +0200

Dear XML/Python-gurus,

I've encountered a slight problem while using the otherwise quite
excellent PyXML package 
(version 0.6.2, IIRC). One of my functions iterates thought a long list
of long XML files
with varying encodings, which makes it quite sensisitive to both memory
use and Unicode issues. 
I'm using the DOM interface and read the XML data using 

import xml.dom.ext.reader.Sax2
f = open('myfile')
doc = xml.dom.ext.reader.Sax2.FromXMLStream(f)
f.close()

Unfortunately the default parser seeems to have serious memory
management problems: the 
total amount of used memory grows by 1-2 megabytes for each processed
file. A forced 
garbage collection (this is Py2.0) doesn't help at all. The most obvious
solution was to use 
a  different parser - we needed a validating parser anyhow. And adding
the keyword 'validate=1'
to the 'FromXMLStream' call did indeed solve the memort leak bug.
However, an even more serious 
problem was now encountered; the default *validating* parser returns 
normal Python string, while the default parser returns Unicode strings
as any sensible 
XML-processing tool should do. This behaviour do cause any amount of
trouble elsewhere 
in the code: The PrettyPrinter, for example, don't work at all with
normal strings 
with non-ascii chars.

I don't have the names of the parsers with problems right here, but the
test runs were
done on a Linux box with PyXML 0.6.2.

Yours
	Jere Kahanpää
	jere.kahanpaa@helsinki.fi