getting data with proper encoding to the finish

John Machin sjmachin at lexicon.net
Wed Mar 16 16:21:42 EST 2005


Ksenia Marasanova wrote:
> John, Serge, thanks for your help!

Thank *you* for having interesting problems :-)

>
> utf-16le  encoding didn't help. I had however to solve it yesterday,
> so I used csv module to create CSV file and then import it in Excel.
> Excel still had troubles with accented characters, but this is
another
> story: it seems that Office 2004 Excel (for Mac, but I assume the PC
> version is no better) cannot import UTF-8 encoded text files.
Encoding
> CSV file with Latin1 encoding finally did work.

Yes, Excel appears not to understand UTF-8. It interprets CSV files
according to the current locale / codepage / whatever -- the "old bad
way" that Unicode is meant to save us from.

An alternative, if you need to represent more than one codepage, or
want a "new good way" of doing it: Excel allows "Save As" to "Unicode
Text" format. It uses Unicode tab u'\t' as delimiter. It quotes tabs,
quotes quotes by doubling them, and [weirdly] also quotes cells which
have only a comma [maybe locale-dependent] in them. It quite happily
opens such files without data loss. You should be able to make such
files easily with Python.

Here's a dump of such a file created by Excel 2002 on Windows -- pls
pardon any Cyrillic spelling errors :-)

>>> file('C:/junk/km_u16.txt', 'rb').read().decode('utf16')
u'\u041c\u0430\u0440\u0430\u0441\u0430\u043d\u043e\u0432\u0430\t\u041a\u0441\u0435\u043d\u044f\r\n"comma,
comma, comma"\t\r\n"quote ""Hello UniWorld""
unquote"\t\r\n"tab\ttab"\t\r\n'

>>> print file('C:/junk/km_u16.txt', 'rb').read().decode('utf16')
Марасанова	Ксеня

"comma, comma, comma"

"quote ""Hello UniWorld"" unquote"

"tab	tab"

To make such a file, you would need a quoter function something like
this; you would apply it to each field:

>>> def unitextquoter(s, quote=u'"', alt_delim=u','):
!	if quote in s:
!		return quote + s.replace(quote, quote+quote) + quote
!	if alt_delim in s or u'\t' in s:
!		return quote + s + quote
!	return s

>>> unitextquoter(u'comma, comma, comma')
u'"comma, comma, comma"'
>>> unitextquoter(u'tab\ttab')
u'"tab\ttab"'
>>> unitextquoter(u'quote "Hello UniWorld" unquote')
u'"quote ""Hello UniWorld"" unquote"'
>>>

Then you would do u'\t'.join(fields) , add on u'\r\n' [or whatever is
needed in your environment], .encode('utf16') and .write() to your 'wb'
file.

>
> Now back to the Excel story, I also think that there is something
> wrong with pyExcelWriter or the way I use it. CSV file generation was
> okay, so I think there is nothing wrong with my data,  or XML parser.
>
> I will resume in a few days with pyExcelWriter and will post the
> results here, but anyway, many thanks for your time and explanation!

I've been reading the source and looking at the Excel file specs
[available from openoffice.org if you're very short of reading
material!]. Apparently pyXLWriter doesn't handle Unicode at all.
Although Unicode came in with Excel 1997 (BIFF8 format file),
pyXLWriter appears to support only Excel 5(?) (BIFF5 format file). As
Serge suggested, appeal to the porter to appeal to the author of the
Perl module it's ported from; but don't hold your breath in the
meantime.

Cheers,
John




More information about the Python-list mailing list