[XML-SIG] xmlproc bug?

Lars Marius Garshol larsga@garshol.priv.no
04 May 2001 22:22:07 +0200


* Rich Salz
|
| If you feed() a unicode string into an xmlproc parser, Python barfs at
| line 234
|      # ignore unusal byte orders 2143 and 3412
|      elif new_data[:2] == '\xfe\xff':
|          enc = "utf-16-be" # with BOM
| 
| because apparently it is trying to convert the string to unicode and
| it's got 8bit characters.

The problem here is that we are trying to autodetect the encoding of a
Unicode string, but a Unicode string is already in Unicode and so
needs no decoding.

You can solve this by setting the decoded parameter to feed to 1, but
it would be better if you did not have to.

Fixed it by doing the following:

Index: xml/parsers/xmlproc/xmlutils.py
===================================================================
RCS file: /cvsroot/pyxml/xml/xml/parsers/xmlproc/xmlutils.py,v
retrieving revision 1.16
diff -c -r1.16 xmlutils.py
***************
*** 285,290 ****
--- 285,295 ----
  
          new_data = new_data+self.encoded_data
          self.encoded_data = ""
+ 
+         if not decoded and using_unicode and \
+            type(new_data) == types.UnicodeType:
+             decoded = 1
+         
          if not decoded and not self.charset_converter:
              self.autodetect_encoding(new_data)
              # If this returns with no auto-detected encoding, i.e.  if

I need to check it first before committing it, but this should solve
the problem. (Am waiting for glibc to download, so that I can compile
Python 2.1, so that I can actually test this. The download is going
slowly, so I am posting before the commit.)

--Lars M.