Parsing unicode (devanagari) text with xml.dom.minidom

rparimi at gmail.com rparimi at gmail.com
Sun Mar 8 02:24:03 CET 2009


Hello,

I am trying to process an xml file that contains unicode characters
(see http://vyakarnam.wordpress.com/). Wordpress allows exporting the
entire content of the website into an xml file. Using
xml.dom.minidom,  I wrote a few lines of python code to parse out the
xml file, but am stuck with the following error:

>>> import xml.dom.minidom
>>> dom = xml.dom.minidom.parse("wordpress.2009-02-19.xml")
>>> titles = dom.getElementsByTagName("title")
>>> for title in titles:
...    print "childNode = ", title.childNodes
...
childNode =  [<DOM Text node "Sanskrit N...">]
childNode =  [<DOM Text node "Sanskrit N...">]
childNode =  []
childNode =  []
childNode =  [<DOM Text node "1-1-1">]
childNode =  Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
16-18: ordinal not in range(128)
>>>

Python exited when it was trying to parse the following node:
<title>अन् </title>

The xml header tells me that the document is UTF-8:
<?xml version="1.0" encoding="UTF-8"?>

I am running python 2.5.1 on Mac OSX 10.5.6 and my local settings are
as below:
$locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=


I googled around for similar errors, and tried using unicode but that
didn't help either:
>>> foo = unicode(titles[5].childNodes)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
16-18: ordinal not in range(128)

I'm a novice with unicode, and am not not sure about how best to
handle the unicode  text I'm dealing with (devanagari). Any
suggestions will be helpful.

Thanks



More information about the Python-list mailing list