codecs, Swedish characters, and XML...don't mix?

Mike Hammill mike at pdc.kth.se
Thu May 10 21:51:52 CEST 2001


Hi,

A Web searched shows that there used to be some problems with python, Swedish 
characters (two with umlauts ö, ä, one with a circle å all are part of 
ISO8859-1), and XML.  I'm still having the problem with Python 2.1.  Does 
anyone know what's wrong?

Brief description of the problem:
(1) read an XML file containing Swedish characters in using Python's LATIN1 
codec.
(2) write the same file out using Python's UTF-8 codec
(3) read file with xml.minidom, but get error when writing using
dom.toxml():
"UnicodeError: ASCII encoding error: ordinal not in range(128)"

More detailed:
(1) Python version:
Python 2.1 (#1, Apr 17 2001, 20:20:54) 
[GCC 2.96 20000731 (Red Hat Linux 7.0)] on linux2

(2) Code:
#!/usr/bin/python2.1
import codecs
import xml.dom.minidom
import string

try = 'doc.swedish.xml'
out = 'doc.utf8.xml'
out2= 'del.me'
outout = 'doc.utf8.after.xml'

def main():
    (LATIN1_encode, LATIN1_decode, LATIN1_streamreader, LATIN1_streamwriter) = 
codecs.lookup('ISO8859-1')
    (UTF8_encode, UTF8_decode, UTF8_streamreader, UTF8_streamwriter) = 
codecs.lookup('UTF-8')

    input = LATIN1_streamreader(open(try, 'r'))
    s = input.read()
    input.close()
   
    output = UTF8_streamwriter( open(out, 'w') )
    output.write(s)
    output.close()

    f = open(out, 'r')
    g = open(out2, 'w')
    f_list = f.readlines()
    f.close()
    del f_list[0]
    f_list.insert(0,'<?xml version="1.0" encoding="UTF-8"?>')
    g.writelines(f_list)
    g.close()
    ff = open(out2, 'r')
    dom = xml.dom.minidom.parse(ff)
    gg = open(outout, 'w')
    gg.write(dom.toxml())

if __name__ == '__main__':
    main()

(3) File "doc.swedish.xml":
<?xml version="1.0" encoding="iso-8859-1"?>
<slideshow>
<title>Demo slideshöw
</title>
</slideshow>

(4) Traceback:
Traceback (most recent call last):
  File "./q_mini4.py", line 49, in ?
    main()
  File "./q_mini4.py", line 46, in main
    gg.write(dom.toxml())
UnicodeError: ASCII encoding error: ordinal not in range(128)
lxl01:/public/www/snac/Spring_2001/adm/crontab>

(5) Observations:
(a) The characters look fine in the input file.  A simple in and out decoding 
of them using LATIN1 codec for both in and out produces the same file as the 
original.
(b) The output UTF8 file does change the umlauted o to a different looking 
couple of characters
(c) If the Swedish character is replaced by a regular "o" in the input file, 
everything works fine.
(d) Yikes!  This shouldn't be that hard!






More information about the Python-list mailing list