csv module and unicode, when or workaround?
sjmachin at lexicon.net
Sat Mar 12 11:38:10 CET 2005
> to convert excel files via csv to xml or whatever I frequently use
> csv module which is really nice for quick scripts. problem are of
> non ascii characters like german umlauts, EURO currency symbol etc.
The umlauted characters should not be a problem, they're all in the
first 256 characters. What makes you say they are a problem "of
> the current csv module cannot handle unicode the docs say, is there
> workaround or is unicode support planned for the near future? in most
> cases support for characters in iso-8859-1(5) would be ok for my
> purposes but of course full unicode support would be great...
Here's a perambulation through some of the alternatives:
A. If you save the file from Excel as "Unicode text", you can pretty
>>> buff = file('csvtest.txt', 'rb').read()
>>> lines = buff.decode('utf16').split(u'\r\n')
>>> for line in lines:
... print line.split(u'\t')
All you have to do is handle (1) Excel's unnecessary quoting of the
comma in the money amounts [see first two lines above; what it quotes
is probably locale-dependent] (2) double quoting any quotes [no example
given] (3) ignore the empty "line" introduced by split().
Problem (3) is easy: if not lines[-1:]: del lines[-1:]
Hmmm ... by the time you finish this (and generalise it) you will have
done the Unicode extension to the csv module ...
Alternative B: you can do ODBC access to Excel spreadsheets; hmmm ...
yuk ... no better than CSV i.e. you get the data in your current code
page, not in Unicode:
[('M\xfcller', '\x801234,56'), ('M\xf6ller', '\x809876,54'),
Alternative C: why not save your file as local-code-page .csv, use the
csv module, and DIY decode:
>>> rdr = csv.reader(file('csvtest.csv', 'rb'))
>>> for row in rdr:
... print row
... urow = [x.decode('cp1252') for x in row]
... print urow
Looks good to me, including the euro sign.
More information about the Python-list