xml input sanitizing method in standard lib?
Stefan Behnel
stefan_ml at behnel.de
Tue Mar 10 03:40:10 EDT 2009
Gabriel Genellina wrote:
> En Mon, 09 Mar 2009 15:30:31 -0200, Petr Muller <afri at afri.cz> escribió:
>
>> Thanks for response and sorry for I wasn't clear first time. I have a
>> heap of data (logs), from which I build a XML document using
>> xml.dom.minidom. In this data, some xml invalid characters may occur -
>> form feed (\x0c) character is one example.
>>
>> I don't know what else is illegal in xml, so I've searched if there's
>> some method how to prepare strings for insertion to a xml doc before I
>> start research on a xml spec and write such function on my own.
>
> You don't have to; Python already comes with xml support. Using
> ElementTree to build the document is usually easier and faster:
> http://effbot.org/zone/element-index.htm
While I usually second that, this isn't the problem here. This thread is
about unallowed characters in XML. The set of allowed characters is defined
here:
http://www.w3.org/TR/xml/#charsets
And, as Terry Reedy pointed out, the "unicode.translate" method should get
you where you want. Just define a dict that maps the characters that you
want to remove to whatever character you want to use instead (or None) and
pass that into .translate().
Stefan
More information about the Python-list
mailing list