Unicode strings -> xml.dom.minidom Text elements?

Patrick Surry Patrick.Surry at quadstone.com
Thu Oct 24 10:59:19 EDT 2002


Thanks for the replies on this below - took me a while to see them as I
subscribe to this by mailing-list and our mail server seems to be eating most
of the digests (please CC any replies by email to pds at quadstone.com if
possible).  

First off, I apologize that my original symptom is a Jython issue, not a Python
one (as a newcomer I'm still a bit confused as to how the two overlap, since I
have to rely on Python docs for almost everything I do in Jython).  Jython
(2.1) does indeed create an ascii question-mark character in the output rather
than throwing a UnicodeError exception as Python (2.2) does.

However, that still doesn't really solve my problem - I want to generate an XML
document to a file with a text element containing a greek capital letter sigma
(\u03A3).  xml.dom.minidom._write_data() escapes &, <, /, and > symbols, but
does nothing to unicode, so the writer still tries to emit a unicode value to
my output data-file and barfs.  I clearly can't escape manually beforehand
(since _write_data will re-escape my & character).  Suppose I could derive a
custom 'writer' that escapes after the fact, but this seems like hard-work,
assumed there would be a standard way of doing this?

ie. basically I want to code something like this:

from xml.dom.minidom import Document
d = Document()
e = d.createElement('foo')
t = d.createTextNode(u'ABC\u03a3DEF')
d.appendChild(e)
e.appendChild(t)
d.writexml(somewriter)

If I look at d.toxml() I get this:

>>> d.toxml()
u'<?xml version="1.0" ?>\n<foo>ABC\u03a3DEF</foo>'

but now I want to serialize it to a correct XML file (I don't care what
encoding) that preserves the unicode greek capital sigma.

Apologies again if this is a unicode/Python FAQ, but I'm relatively new to both
and am stuck...



Note re my original symptom:

In Jython, unicode chars get written as bogus characters to the output stream:

Jython 2.1 on java1.3.1_04 (JIT: null)
Type "copyright", "credits" or "license" for more information.
>>> s = u'ABC\u03a3DEF'
>>> s
u'ABC\u03A3DEF'
>>> import sys
>>> sys.stdout.write(s)
ABC?DEF>>>
>>> ^Z

Note that really is an ascii '?' character - if write to a file in text mode,
you get following from 'od':

D:\pds\jython>od -Ax -c -b foo-text-mode
000000   A   B   C   ?   D   E   F
       101 102 103 077 104 105 106
000007

If you write the file in binary mode, you get the low-order byte of the unicode
char (0xA3 == 0243):

D:\pds\jython>od -Ax -c -b foo-binary-mode
000000   A   B   C   ú   D   E   F
       101 102 103 243 104 105 106
000007


In Python you get an exception:

Python 2.2.1 (#34, Apr  9 2002, 19:34:33) [MSC 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> s = u'ABC\u03a3DEF'
>>> s
u'ABC\u03a3DEF'
>>> import sys
>>> sys.stdout.write(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: ASCII encoding error: ordinal not in range(128)




> Subject: Re: Unicode strings -> xml.dom.minidom Text elements?
> Date: 21 Oct 2002 19:31:21 +0200
> From: martin at v.loewis.de (Martin v. Loewis)
> Organization: Linux Private Site
> Newsgroups: comp.lang.python
> References: <mailman.1035217827.1293.python-list at python.org>
> 
> Patrick Surry <Patrick.Surry at quadstone.com> writes:
> 
> > and am stuffing it into an xml.dom.minidom Text() element.  But when I
> > serialize the document with doc.writexml(), it turns into:
> >
> > <text>ABC?DEF</text>
> 
> I find that hard to believe. Are you sure it really puts a question
> mark in there? Or is it just that your email program is not capable of
> sending GREEK CAPITAL LETTER SIGMA?
> 
> Have you, by any chance, modified sys.setdefaultencoding?
> 
> > This seems to be because writexml() effectively does
> >
> > writer.write('%s' % a)
> >
> > making the unicode character turn into a '?'
> 
> Extremely unlikely. Can you show a complete program that demonstrates
> this problem?
> 
> > Am I doing something dumb and/or is there a workaround I could use
> > other than writing my own XML unicode character escaper...
> 
> As a starting point, I recommend that you refrain from setting the
> default encoding to "mbcs". As the next step, I recommend that you try
> to save the XML document in UTF-8.
> 
> As it is, writexml is not capable of escaping characters itself. So
> you will find that writexml gives you a Unicode string, which you need
> to encode as UTF-8 yourself.
> 
> Depending on where exactly you got writexml from, you may find that it
> has an encoding= parameter. It still won't produce character
> references, though.
> 
> Regards,
> Martin
> 
>   -------------------------------------------------------------------------------
> 
> Subject: Re: Unicode strings -> xml.dom.minidom Text elements?
> Date: Mon, 21 Oct 2002 18:52:09 +0100
> From: Alan Kennedy <alanmk at hotmail.com>
> Organization: xhaus.com
> Newsgroups: comp.lang.python
> References: <mailman.1035217827.1293.python-list at python.org>
> 
> Patrick Surry wrote:
> >
> > I've got a unicode string like:
> >
> > a = u'ABC\u03A3DEF'
> >
> > and am stuffing it into an xml.dom.minidom Text() element.  But when I
> > serialize the document with doc.writexml(), it turns into:
> >
> > <text>ABC?DEF</text>
> 
> Where are you writing the xml to? To a file? To a character terminal?
> 
> If you're writing to a file, then the following questions are also
> important
> 
>  o What encoding are you using when writing the file?
>  o Is that encoding correctly declared in the file?
>  o How are you viewing the contents of the file? (e.g. browser, text
> editor, etc)
> 
> If you are viewing it on a character terminal, what character set does
> the terminal use? (On windows (for example), use the command "chcp" to
> see the "code page" in use).
> 
> What is the default encoding of your python installation? Check this
> with "import sys; sys.getdefaultencoding()"
> 
> I have my default python encoding set to "iso-8859-1", and observe the
> following behaviour.
> 
> X:\alan\pytal\test>python
> Python 2.2.1 (#34, Apr  9 2002, 19:34:33) [MSC 32 bit (Intel)] on win32
> Type "help", "copyright", "credits" or "license" for more information.
> >>> s = u"ABC\u03A3DEF"
> >>> s
> u'ABC\u03a3DEF'
> >>> import sys
> >>> sys.stdout.write(s)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeError: Latin-1 encoding error: ordinal not in range(256)
> regards,
> 
> alan kennedy
> -----------------------------------------------------
> check http headers here: http://xhaus.com/headers
> email alan:              http://xhaus.com/mailto/alan




More information about the Python-list mailing list