[Tutor] Handling a Unicode Return using Pyodbc

Tue Nov 15 10:16:56 CET 2011

On 14/11/2011 21:43, Tony Pelletier wrote:
> Good Afternoon,
>
> I'm writing a program that is essentially connecting to MS SQL Server
> and dumping all the contents of the tables to separate csv's.  I'm
> almost complete, but now I'm running into a Unicode issue and I'm not
> sure how to resolve it.
>
> I have a ridiculous amount of tables but I managed to figure out it was
> my Contact and a contact named Robert Bock.  Here's what I caught.
>
> (127, None, u'Robert', None, u'B\xf6ck', 'uCompany Name', None, 1, 0,
> 327, 0)
>
> The u'B\xf6ck' is actually Böck.  Notice the ö
>
> My problem is I'm not really sure how to handle it and whether or not
> it's failing on the query or the insert to the csv.  The Exception is:
>
> 'ascii' codec can't encode character u'\xf6' in position 1: ordinal not
> in range(128)

Thanks for producing a thinned-down example. If I may take this at
face value, I assume you're doing something like this:

<code>
import csv

#
# Obviously from database, but for testing...
#
data = [
   (127, None, u'Robert', None, u'B\xf6ck', 'uCompany Name', None, 1, 0, 
327, 0),
]

with open ("temp.csv", "wb") as f:
   writer = csv.writer (f)
   writer.writerows (data)

</code>

which gives the error you describe.

In short, the csv module in Python 2.x (not sure about 3.x) is 
unicode-unaware. You're passing it a unicode object and it's got no way 
of knowing what codec to use to encode it. So it doesn't try to guess: 
it just uses the default (ascii) and fails.

And this is where it gets just a little bit messy. Depending on how much 
control you have over your data and how important the unicodeiness of it 
is, you need to encode things explicitly before they get to the csv module.

One (brute force) option is this:

<code snippet>
def encoded (iterable_of_stuff):
   return tuple (
     (i.encode ("utf8") if isinstance (i, unicode) else i)
       for i in iterable_of_stuff
   )

#
# ... other code
#
writer.writerows ([encoded (row) for row in data])

</code snippet>

This will encode anything unicode as utf8 and leave everything else 
untouched. It will slow down your csv generation, but that might well 
not matter (especially if you're basically IO-bound).

TJG