[XML-SIG] stripping 8-bit ASCII from an XML stream using encode?

Martin v. Loewis martin@v.loewis.de
Sat, 5 Jan 2002 08:11:08 +0100

> The code at the end of demonstrates a problem I'm having parsing XML files
> downloaded from SourceForge. 

Please understand that the file you have downloaded is *not* an XML
file, even though it may look like one at a shallow glance. The bytes
above 128 are the precise cause of this ill-formedness: the file does
not declare an encoding, so it ought to be UTF-8; yet it is not (most
likely, it is meant to be iso-8859-1).

For any processing you are performing, I'd be sorry to hear that you
are ignoring the exact spelling of my name :-)

>     t = t.encode('ascii', 'ignore')
> UnicodeError: ASCII decoding error: ordinal not in range(128)
> I haven't done much XML processing, so this could be a FAQ, but I haven't
> been able to find the answer so far. What is the proper way to strip the
> 8-bit values? 

It's not really an XML issue. The .encode call is equivalent to

   encoder = codecs.lookup('ascii')[0]
   t = encoder(t, 'ignore')[0]

Now, encoder is ascii_encode, which is a function expecting a Unicode
string, returning the ASCII byte string. The first argument to encoder
is a (byte) string, so that is auto-converted to Unicode first, using
the system encoding, in strict mode. Since the system encoding is
'ascii' also, your code becomes equivalent to

  encoder, decoder, _, _ = codecs.lookup('ascii')
  t = encoder(decoder(t, 'strict')[0], 'ignore')[0]
          #or unicode(t, 'ascii')

It is the conversion to Unicode that fails. If you write

  t = unicode(t, 'ascii', 'ignore').encode('ascii')

you will strip the non-ASCII characters.

> Is there another issue at work here?

Definitely. It would be much better if it was proper XML that you try
to parse. If you add an XML header, i.e.

<?xml version="1.0" encoding="iso-8859-1"?>

minidom.parse will process it just fine. The carriage-return
characters do no harm, either, since an XML processor is supposed to
deal with various line endings; I could not find occurrences of other
control characters in that docment.