[Tutor] Confusing Unicode Conversion Problem.

Wed Dec 13 10:32:07 CET 2006

[Chris Hengge]

| 'ascii' codec can't encode character u'\xa0' in position 11: 
| ordinal not in range(128)
| Error with: FRAMEMRISER  of type: <type 'unicode'>
| Excel Row : 6355

OK. Let's get to the basics first:

<code>
import unicodedata
print unicodedata.name (u'\xa0')
# outputs: NO-BREAK SPACE

</code>

So somewhere (maybe at the end) of your unicode
string is a non-breaking space. (Notice that
extra space between "FRAMERISER" and "of" in
the message above.

Next, when you print to the screen, you're implicitly
using the sys.stdout encoding, which on my XP machine
is cp437:

<code>
import sys
print sys.stdout.encoding
# outputs: cp437

print u'\xa0'.encode (sys.stdout.encoding)
# outputs a blank line, presumably including a non-breaking space

</code>

But when you convert to a str using str (...) Python
will use an ascii encoding. So let's try that:

<code>
print str (u'\xa0')
# sure enough: UnicodeError, blah, blah

</code>

In essence, when you're using Unicode data, you either
need to encode immediately to a consistent encoding of
your choice (or possibly forced upon you) or to retain
Unicode data throughout until you need to output, to
screen or database or file, and then convert as needed.

Let's take your code (snipped a bit):

1             while xlSht.Cells(row,col).Value != None:
2                      tempValue = xlSht.Cells(row,col).Value
3                      tempString = str(tempValue).split('.')[0] 
4                      ExcelValues.append(tempString)
5                      Row = 1 + row # Increment Rows.

It's not clear what ExcelValues is, but let's assume
it's a list of things you're going to output later
to a file. Your line 3 is doing an implicit conversion
when it doesn't look like it needs to. Have a look
at this trivial example:

<code>
import codecs

fake_excel_data = ([u"Stuff.0", u"\xa0and\xa0.1", u"nonsense.2"])
values = []

for data in fake_excel_data:
  pre, post = data.split (".")
  values.append (pre)

#
# later...
#
f = codecs.open ("excel_values.txt", "w", "utf-8")
try:
  f.writelines (values)
finally:
  f.close ()

</code>

Notice I haven't done the encoding until I finally
output to a file, where I've used the codecs module
to specify an encoding. You could do this string by
string or some other way.

If I were simply writing back to, say, another
Excel sheet, or any other target which was expecting
Unicode data, I wouldn't encode it anywhere. The Unicode
objects offer nearly all the same methods as the
string objects so you just use them as you would strings.

What you have to look out for is situations like
your str () conversion where an implicit encoding-to-ascii
goes on.

HTH
TJG

________________________________________________________________________
This e-mail has been scanned for all viruses by Star. The
service is powered by MessageLabs. For more information on a proactive
anti-virus service working around the clock, around the globe, visit:
http://www.star.net.uk
________________________________________________________________________