[XML-SIG] Unicode support problems in parsers

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Sat, 17 Feb 2001 17:05:37 +0100


> Unfortunately the default parser seeems to have serious memory
> management problems: the total amount of used memory grows by 1-2
> megabytes for each processed file. A forced garbage collection (this
> is Py2.0) doesn't help at all.

pyexpat in 0.6.2 had a number of memory leaks, most of which got fixed
in 0.6.3, although some are only fixed in the CVS. So if you take the
pyexpat.c from CVS, things should look much better.

There were two problems: the SAX reader created cyclic garbage (which
it shouldn't), and pyexpat would not participate in garbage
collection, which caused cycles involving Parser objects not to be
collected.

> However, an even more serious problem was now encountered; the
> default *validating* parser returns normal Python string, while the
> default parser returns Unicode strings as any sensible
> XML-processing tool should do.

Yes, this is a known problem with xmlproc in the Python CVS, I hope
Lars Marius will contribute an updated version soon.

> This behaviour do cause any amount of trouble elsewhere in the code:
> The PrettyPrinter, for example, don't work at all with normal
> strings with non-ascii chars.

Which, in turn, is a bug in the pretty printer - since we are
attempting backwards compatibility with 1.5.2, it *should* support
plain strings.

> I don't have the names of the parsers with problems right here, but
> the test runs were done on a Linux box with PyXML 0.6.2.

Sorry for the inconvenience. If you need a fix right away, I suggest
you either use the PyXML CVS, or the 4Suite 0.10.2 beta, which has
many of the components updated. If you can wait somewhat longer - I
hope that I can release PyXML 0.6.4 in the near future.

Regards,
Martin