[XML-SIG] Unicode support problems in parsers
Wed, 14 Feb 2001 13:09:51 +0200
I've encountered a slight problem while using the otherwise quite
excellent PyXML package
(version 0.6.2, IIRC). One of my functions iterates thought a long list
of long XML files
with varying encodings, which makes it quite sensisitive to both memory
use and Unicode issues.
I'm using the DOM interface and read the XML data using
f = open('myfile')
doc = xml.dom.ext.reader.Sax2.FromXMLStream(f)
Unfortunately the default parser seeems to have serious memory
management problems: the
total amount of used memory grows by 1-2 megabytes for each processed
file. A forced
garbage collection (this is Py2.0) doesn't help at all. The most obvious
solution was to use
a different parser - we needed a validating parser anyhow. And adding
the keyword 'validate=1'
to the 'FromXMLStream' call did indeed solve the memort leak bug.
However, an even more serious
problem was now encountered; the default *validating* parser returns
normal Python string, while the default parser returns Unicode strings
as any sensible
XML-processing tool should do. This behaviour do cause any amount of
in the code: The PrettyPrinter, for example, don't work at all with
with non-ascii chars.
I don't have the names of the parsers with problems right here, but the
test runs were
done on a Linux box with PyXML 0.6.2.