[issue5166] ElementTree and minidom don't prevent creation of not well-formed XML

Denis S. Otkidach report at bugs.python.org
Tue Nov 24 18:26:35 CET 2009


Denis S. Otkidach <denis.otkidach at gmail.com> added the comment:

Here is a regexp I use to clean up text (note, that I don't touch 
"compatibility characters" that are also not recommended in XML; some 
other developers remove them too):

# http://www.w3.org/TR/REC-xml/#NT-Char
# Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | 
#          [#x10000- #x10FFFF]
# (any Unicode character, excluding the surrogate blocks, FFFE, and 
FFFF)
_char_tail = ''
if sys.maxunicode > 0x10000:
    _char_tail = u'%s-%s' % (unichr(0x10000),
                             unichr(min(sys.maxunicode, 0x10FFFF)))
_nontext_sub = re.compile(
                ur'[^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD%s]' % 
_char_tail,
                re.U).sub
def replace_nontext(text, replacement=u'\uFFFD'):
    return _nontext_sub(replacement, text)

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue5166>
_______________________________________


More information about the Python-bugs-list mailing list