csv module and unicode, when or workaround?

Sat Mar 12 17:41:18 EST 2005

Chris wrote:
> hi,
> thanks for all replies, I try if I can at least get the work done.
>
> I guess my problem mainly was the rather mindflexing (at least for
me)
> coding/decoding of strings...
>
> But I guess it would be really helpful to put the
UnicodeReader/Writer
> in the docs

UNFORTUNATELY the solution of saving the Excel .XLS to a .CSV doesn't
work if you have Unicode characters that are not in your Windows
code-page. Nor would it work in a CJK environment if the file was saved
in an MBCS encoding (e.g. Big5). A work-around appears possible, with
some more effort:

I have extended the previous sample XLS; there is now a last line with
IVANOV in Cyrillic letters [pardon my spelling etc etc if necessary].
My code-page is cp1252, which sure don't grok Russki :-)

I've saved it as CSV [no complaint from Excel] and as "Unicode text".

>>> buffc = file('csvtest2.csv', 'rb').read()
>>> buffc
'Name,Amount\r\nM\xfcller,"\x801234,56"\r\nM\xf6ller,"\x809876,54"\r\nKawasaki,\xa53456.78\r\n??????,"?5678,90"\r\n'

Thanks a lot, Bill! That's really clever.

>>> buffu16 = file('csvtest2.txt', 'rb').read()
>>> buffu16
'\xff\xfeN\x00a\x00m\x00e\x00\t\x00A\x00m\x00o\x00u\x00n\x00t\x00\r\x00\n\x00
[snip] \x18\x04\x12\x04
\x10\x04\x1d\x04\x1e\x04\x12\x04\t\x00"\x00
\x045\x006\x007\x008\x00,\x009\x000\x00"\x00\r\x00\n\x00'
>>> buffu = buffu16.decode('utf16')
>>> buffu
u'Name\tAmount\r\nM\xfcller\t"\u20ac1234,56"\r\nM\xf6ller\t"\u20ac9876,54"\r\nKawasaki\t\xa53456.78\r\n\u0418\u0412\u0410\u041d\u041
e\u0412\t"\u04205678,90"\r\n'

Aside: this has removed the BOM. I understood (possibly incorrectly)
from a recent thread that Python codecs left the BOM in there, but hey
I'm not complaining :-)

As expected, this looks OK. The extra step required in the work-around
is to convert the utf16 file to utf8 and feed that to the csv reader.
Why utf8? (1) Every Unicode character can be represented, not just ones
in that are in your code-page (2) ASCII characters can't appear as part
of the representation of any other character -- i.e. ones that are
significant to csv (tab, comma, quote, \r, \n) can't cause errors by
showing up as part of another character e.g. CJK characters.

>>> buffu8 = buffu.encode('utf8')
>>> buffu8
'Name\tAmount\r\nM\xc3\xbcller\t"\xe2\x82\xac1234,56"\r\nM\xc3\xb6ller\t"\xe2\x82\xac9876,54"\r\nKawasaki\t\xc2\xa53456.78\r\n\xd0\x
98\xd0\x92\xd0\x90\xd0\x9d\xd0\x9e\xd0\x92\t"\xd0\xa05678,90"\r\n'
>>> x = file('csvtest2.u8', 'wb')
>>> x.write(buffu8)
>>> x.close()
>>> import csv
>>> rdr = csv.reader(file('csvtest2.u8', 'rb'), delimiter='\t')
>>> for row in rdr:
...     print row
...     print [x.decode('utf8') for x in row]
...
['Name', 'Amount']
[u'Name', u'Amount']
['M\xc3\xbcller', '\xe2\x82\xac1234,56']
[u'M\xfcller', u'\u20ac1234,56']
['M\xc3\xb6ller', '\xe2\x82\xac9876,54']
[u'M\xf6ller', u'\u20ac9876,54']
['Kawasaki', '\xc2\xa53456.78']
[u'Kawasaki', u'\xa53456.78']
['\xd0\x98\xd0\x92\xd0\x90\xd0\x9d\xd0\x9e\xd0\x92', '\xd0\xa05678,90']
[u'\u0418\u0412\u0410\u041d\u041e\u0412', u'\u04205678,90']
>>>

Howzat?

Cheers,
John