[XML-SIG] Unicode support problems in parsers
Jere Kahanpää
jere.kahanpaa@helsinki.fi
Wed, 14 Feb 2001 13:09:51 +0200
Dear XML/Python-gurus,
I've encountered a slight problem while using the otherwise quite
excellent PyXML package
(version 0.6.2, IIRC). One of my functions iterates thought a long list
of long XML files
with varying encodings, which makes it quite sensisitive to both memory
use and Unicode issues.
I'm using the DOM interface and read the XML data using
import xml.dom.ext.reader.Sax2
f = open('myfile')
doc = xml.dom.ext.reader.Sax2.FromXMLStream(f)
f.close()
Unfortunately the default parser seeems to have serious memory
management problems: the
total amount of used memory grows by 1-2 megabytes for each processed
file. A forced
garbage collection (this is Py2.0) doesn't help at all. The most obvious
solution was to use
a different parser - we needed a validating parser anyhow. And adding
the keyword 'validate=1'
to the 'FromXMLStream' call did indeed solve the memort leak bug.
However, an even more serious
problem was now encountered; the default *validating* parser returns
normal Python string, while the default parser returns Unicode strings
as any sensible
XML-processing tool should do. This behaviour do cause any amount of
trouble elsewhere
in the code: The PrettyPrinter, for example, don't work at all with
normal strings
with non-ascii chars.
I don't have the names of the parsers with problems right here, but the
test runs were
done on a Linux box with PyXML 0.6.2.
Yours
Jere Kahanpää
jere.kahanpaa@helsinki.fi