[Tutor] UnicodeEncodeError
Albert-Jan Roskam
fomcl at yahoo.com
Thu Nov 26 14:13:06 CET 2009
OK, thanks a lot Spir and Kent for your replies. I converted element.text to str because some of the element.text were integers and these caused TypeErrors later on in the program. I don't have the program here (it's in the office) so I can't tell you the exact details. It's a search-and-replace program where users can enter a search text (or regex pattern) and a replace text. The source file is an xml file. Currently, strings with non-ascii letters still need to be inputted in unicode format, eg. u'enqu\xeate' instead of "enquête". Kinda ugly. I'll try to fix that later. Thanks again!
Cheers!!
Albert-Jan
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In the face of ambiguity, refuse the temptation to guess.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
--- On Wed, 11/25/09, Kent Johnson <kent37 at tds.net> wrote:
From: Kent Johnson <kent37 at tds.net>
Subject: Re: [Tutor] UnicodeEncodeError
To: "Albert-Jan Roskam" <fomcl at yahoo.com>
Cc: "tutor at python.org tutor at python.org tutor at python.org" <tutor at python.org>
Date: Wednesday, November 25, 2009, 5:55 PM
On Wed, Nov 25, 2009 at 8:44 AM, Albert-Jan Roskam <fomcl at yahoo.com> wrote:
Hi,
I'm parsing an xml file using elementtree, but it seems to get stuck on certain non-ascii characters (for example: "ê"). I'm using Python 2.4. Here's the relevant code fragment:
# CODE:
for element in doc.getiterator():
try:
m = re.match(search_text, str(element.text))
except UnicodeEncodeError:
raise # I want to get rid of this exception.
# PRINTBACK:
m = re.match(search_text, str(element.text))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xea' in position 4: ordinal not in range(128)
You can't convert element.text to a str because it contains non-ascii characters. Why are you converting it? re.match() will accept a unicode string as its argument.
How can I get rid of this unicode encode error. I tried:
s = str(element.text)
s.encode("utf-8")
(and then feeding it into the regex)
This fails because it is the str() that won't work. To get UTF-8 use
s = element.text.encode('utf-8')
but I don't think this is the correct solution.
The xml file is in UTF-8. Somehow I need to tell the program not to use ascii but utf-8, right?
No, just pass Unicode to re.match().
Kent
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20091126/6ed8a80e/attachment.htm>
More information about the Tutor
mailing list