Parsing unicode (devanagari) text with xml.dom.minidom

Sun Mar 8 03:42:35 EDT 2009

rparimi at gmail.com wrote:
> I am trying to process an xml file that contains unicode characters
> (see http://vyakarnam.wordpress.com/). Wordpress allows exporting the
> entire content of the website into an xml file. Using
> xml.dom.minidom,  I wrote a few lines of python code to parse out the
> xml file, but am stuck with the following error:
> 
>>>> import xml.dom.minidom
>>>> dom = xml.dom.minidom.parse("wordpress.2009-02-19.xml")
>>>> titles = dom.getElementsByTagName("title")
>>>> for title in titles:
> ...    print "childNode = ", title.childNodes
> ...
> childNode =  [<DOM Text node "Sanskrit N...">]
> childNode =  [<DOM Text node "Sanskrit N...">]
> childNode =  []
> childNode =  []
> childNode =  [<DOM Text node "1-1-1">]
> childNode =  Traceback (most recent call last):
>   File "<stdin>", line 2, in <module>
> UnicodeEncodeError: 'ascii' codec can't encode characters in position
> 16-18: ordinal not in range(128)

That's because you are printing it out to your console, in which case you
need to make sure it's encoded properly for printing. repr() might also help.

Regarding minidom, you might be happier with the xml.etree package that
comes with Python2.5 and later (it's also avalable for older versions).
It's a lot easier to use, more memory friendly and also much faster.

Stefan