[Tutor] UnicodeEncodeError

Wed Nov 25 15:12:55 CET 2009

Albert-Jan Roskam <fomcl at yahoo.com> wrote:

> # CODE:
> for element in doc.getiterator():
>   try:
>     m = re.match(search_text, str(element.text))
>   except UnicodeEncodeError:
>     raise # I want to get rid of this exception.

First, you should separate both actions done in a single statement to isolate the source of error:
for element in doc.getiterator():
  try:
    source = str(element.text)
  except UnicodeEncodeError:
    raise # I want to get rid of this exception.
  else:
    m = re.match(search_text, source)

I guess
   source = unicode(element;text, "utf8")
should do the job if, actually, you know elements are utf8 encoded (else try latin1, or better get proper information on origin of you doc files).

PS: I just discovered python's builtin attribute file.encoding that should give you the proper encoding to pass to unicode(..., encoding).
PPS: You should in fact decode the whole source before parsing it, no? (meaning parsing a unicode object, not encoded text)

Denis
________________________________

la vita e estrany

http://spir.wikidot.com/