csv module and unicode, when or workaround?

John Machin sjmachin at lexicon.net
Sat Mar 12 05:38:10 EST 2005


Chris wrote:
> hi,
> to convert excel files via csv to xml or whatever I frequently use
the
> csv module which is really nice for quick scripts. problem are of
course
> non ascii characters like german umlauts, EURO currency symbol etc.

The umlauted characters should not be a problem, they're all in the
first 256 characters. What makes you say they are a problem "of
course"?

> the current csv module cannot handle unicode the docs say, is there
any
> workaround or is unicode support planned for the near future? in most

> cases support for characters in iso-8859-1(5) would be ok for my
> purposes but of course full unicode support would be great...
>

Here's a perambulation through some of the alternatives:

A. If you save the file from Excel as "Unicode text", you can pretty
much DIY:

>>> buff = file('csvtest.txt', 'rb').read()
>>> lines = buff.decode('utf16').split(u'\r\n')
>>> lines
[u'M\xfcller\t"\u20ac1234,56"', u'M\xf6ller\t"\u20ac9876,54"',
u'Kawasaki\t\xa53456.78', u'']
>>> for line in lines:
...     print line.split(u'\t')
...
[u'M\xfcller', u'"\u20ac1234,56"']
[u'M\xf6ller', u'"\u20ac9876,54"']
[u'Kawasaki', u'\xa53456.78']
[u'']
>>>

All you have to do is handle (1) Excel's unnecessary quoting of the
comma in the money amounts [see first two lines above; what it quotes
is probably locale-dependent] (2) double quoting any quotes [no example
given] (3) ignore the empty "line" introduced by split().

Problem (3) is easy: if not lines[-1:]: del lines[-1:]

Hmmm ... by the time you finish this (and generalise it) you will have
done the Unicode extension to the csv module ...

Alternative B: you can do ODBC access to Excel spreadsheets; hmmm ...
yuk ... no better than CSV i.e. you get the data in your current code
page, not in Unicode:

[('M\xfcller', '\x801234,56'), ('M\xf6ller', '\x809876,54'),
('Kawasaki', '\xa53456.78')]

Alternative C: why not save your file as local-code-page .csv, use the
csv module, and DIY decode:

>>> rdr = csv.reader(file('csvtest.csv', 'rb'))
>>> for row in rdr:
...    print row
...    urow = [x.decode('cp1252') for x in row]
...    print urow
...
['Name', 'Amount']
[u'Name', u'Amount']
['M\xfcller', '\x801234,56']
[u'M\xfcller', u'\u20ac1234,56']
['M\xf6ller', '\x809876,54']
[u'M\xf6ller', u'\u20ac9876,54']
['Kawasaki', '\xa53456.78']
[u'Kawasaki', u'\xa53456.78']
>>>
Looks good to me, including the euro sign.

HTH,

John




More information about the Python-list mailing list