[XML-SIG] stripping 8-bit ASCII from an XML stream using encode?
Martin v. Loewis
martin@v.loewis.de
Sat, 5 Jan 2002 08:11:08 +0100
> The code at the end of demonstrates a problem I'm having parsing XML files
> downloaded from SourceForge.
Please understand that the file you have downloaded is *not* an XML
file, even though it may look like one at a shallow glance. The bytes
above 128 are the precise cause of this ill-formedness: the file does
not declare an encoding, so it ought to be UTF-8; yet it is not (most
likely, it is meant to be iso-8859-1).
For any processing you are performing, I'd be sorry to hear that you
are ignoring the exact spelling of my name :-)
> t = t.encode('ascii', 'ignore')
> UnicodeError: ASCII decoding error: ordinal not in range(128)
>
> I haven't done much XML processing, so this could be a FAQ, but I haven't
> been able to find the answer so far. What is the proper way to strip the
> 8-bit values?
It's not really an XML issue. The .encode call is equivalent to
encoder = codecs.lookup('ascii')[0]
t = encoder(t, 'ignore')[0]
Now, encoder is ascii_encode, which is a function expecting a Unicode
string, returning the ASCII byte string. The first argument to encoder
is a (byte) string, so that is auto-converted to Unicode first, using
the system encoding, in strict mode. Since the system encoding is
'ascii' also, your code becomes equivalent to
encoder, decoder, _, _ = codecs.lookup('ascii')
t = encoder(decoder(t, 'strict')[0], 'ignore')[0]
#or unicode(t, 'ascii')
It is the conversion to Unicode that fails. If you write
t = unicode(t, 'ascii', 'ignore').encode('ascii')
you will strip the non-ASCII characters.
> Is there another issue at work here?
Definitely. It would be much better if it was proper XML that you try
to parse. If you add an XML header, i.e.
<?xml version="1.0" encoding="iso-8859-1"?>
minidom.parse will process it just fine. The carriage-return
characters do no harm, either, since an XML processor is supposed to
deal with various line endings; I could not find occurrences of other
control characters in that docment.
Regards,
Martin