[Tutor] UnicodeEncodeError

Albert-Jan Roskam fomcl at yahoo.com
Thu Nov 26 14:13:06 CET 2009

OK, thanks a lot Spir and Kent for your replies. I converted element.text to str because some of the element.text were integers and these caused TypeErrors later on in the program. I don't have the program here (it's in the office) so I can't tell you the exact details. It's a search-and-replace program where users can enter a search text (or regex pattern) and a replace text. The source file is an xml file. Currently, strings with non-ascii letters still need to be inputted in unicode format, eg. u'enqu\xeate' instead of "enquête". Kinda ugly. I'll try to fix that later. Thanks again!




In the face of ambiguity, refuse the temptation to guess.


--- On Wed, 11/25/09, Kent Johnson <kent37 at tds.net>:

From: Kent Johnson <kent37 at tds.net>
Subject: Re: [Tutor] UnicodeEncodeError
To: "Albert-Jan Roskam" <fomcl at yahoo.com>
Cc: tutor at python.org
Date: Wednesday, November 25, 2009, 5:55 PM

On Wed, Nov 25, 2009 at 8:44 AM, Albert-Jan Roskam <fomcl at yahoo.com> wrote:

I'm parsing an xml file using elementtree, but it seems to get stuck on certain non-ascii characters (for example: "ê"). I'm using Python 2.4. Here's the relevant code fragment:
for element in doc.getiterator():
    m = re.match(search_text, str(element.text))
  except UnicodeEncodeError:
    raise # I want to get rid of this exception.

    m = re.match(search_text, str(element.text))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xea' in position 4: ordinal not in range(128)

You can't convert element.text to a str because it contains non-ascii characters. Why are you converting it? re.match() will accept a unicode string as its argument.

How can I get rid of this unicode encode error. I tried:
s = str(element.text)
(and then feeding it into the regex)
This fails because it is the str() that won't work. To get UTF-8 use
  s = element.text.encode('utf-8')
 but I don't think this is the correct solution.


The xml file is in UTF-8. Somehow I need to tell the program not to use ascii but utf-8, right?

No, just pass Unicode to re.match().


