csv module and unicode, when or workaround?
John Machin
sjmachin at lexicon.net
Sat Mar 12 05:38:10 EST 2005
Chris wrote:
> hi,
> to convert excel files via csv to xml or whatever I frequently use
the
> csv module which is really nice for quick scripts. problem are of
course
> non ascii characters like german umlauts, EURO currency symbol etc.
The umlauted characters should not be a problem, they're all in the
first 256 characters. What makes you say they are a problem "of
course"?
> the current csv module cannot handle unicode the docs say, is there
any
> workaround or is unicode support planned for the near future? in most
> cases support for characters in iso-8859-1(5) would be ok for my
> purposes but of course full unicode support would be great...
>
Here's a perambulation through some of the alternatives:
A. If you save the file from Excel as "Unicode text", you can pretty
much DIY:
>>> buff = file('csvtest.txt', 'rb').read()
>>> lines = buff.decode('utf16').split(u'\r\n')
>>> lines
[u'M\xfcller\t"\u20ac1234,56"', u'M\xf6ller\t"\u20ac9876,54"',
u'Kawasaki\t\xa53456.78', u'']
>>> for line in lines:
... print line.split(u'\t')
...
[u'M\xfcller', u'"\u20ac1234,56"']
[u'M\xf6ller', u'"\u20ac9876,54"']
[u'Kawasaki', u'\xa53456.78']
[u'']
>>>
All you have to do is handle (1) Excel's unnecessary quoting of the
comma in the money amounts [see first two lines above; what it quotes
is probably locale-dependent] (2) double quoting any quotes [no example
given] (3) ignore the empty "line" introduced by split().
Problem (3) is easy: if not lines[-1:]: del lines[-1:]
Hmmm ... by the time you finish this (and generalise it) you will have
done the Unicode extension to the csv module ...
Alternative B: you can do ODBC access to Excel spreadsheets; hmmm ...
yuk ... no better than CSV i.e. you get the data in your current code
page, not in Unicode:
[('M\xfcller', '\x801234,56'), ('M\xf6ller', '\x809876,54'),
('Kawasaki', '\xa53456.78')]
Alternative C: why not save your file as local-code-page .csv, use the
csv module, and DIY decode:
>>> rdr = csv.reader(file('csvtest.csv', 'rb'))
>>> for row in rdr:
... print row
... urow = [x.decode('cp1252') for x in row]
... print urow
...
['Name', 'Amount']
[u'Name', u'Amount']
['M\xfcller', '\x801234,56']
[u'M\xfcller', u'\u20ac1234,56']
['M\xf6ller', '\x809876,54']
[u'M\xf6ller', u'\u20ac9876,54']
['Kawasaki', '\xa53456.78']
[u'Kawasaki', u'\xa53456.78']
>>>
Looks good to me, including the euro sign.
HTH,
John
More information about the Python-list
mailing list