[Tutor] name shortening in a csv module output
Steven D'Aprano
steve at pearwood.info
Fri Apr 24 02:09:54 CEST 2015
On Wed, Apr 22, 2015 at 10:18:31PM -0700, Jim Mooney wrote:
> My result:
>
> Ï»¿First Name Last Name # odd characters on header line
Any time you see "odd characters" in text like that, you should
immediately think "encoding problem".
These odd characters are normally called mojibake, a Japanese term, or
sometimes gremlins. Mojibake occurs when your text file was written in
one encoding but then read back using another, e.g.:
* file was written using a Windows code page, but read back on a
different PC using a different code page (say, Greek then Russian);
* file was written on a classic Macintosh (pre-OS X), then read on a
DOS or Windows machine;
* file was written on a mainframe that uses ASCII and transferred to
a mainframe that uses EBCDIC;
* file was written using an editor that defaults to one encoding,
and read back using an editor that defaults to a different encoding;
* HTML web page was saved using one encoding, but not declared (or
declared wrongly) and the browser defaults to a different encoding.
Sadly, this problem is hard to solve because text files can't, in
general, contain metadata that tells you how the text is stored. We're
reduced to conventions, hints, and guesses. Given that, the fact that
there is so little mojibake in the world compared to how much there
could be is astonishing.
In your case, those specific three characters Ï»¿ found at the start of
a text file indicates that the text file was saved by Notepad on
Windows. For no good reason (although I'm sure it seemed like a good
idea at the time, if they were smoking crack at the time), Notepad
sometimes puts a UTF-8 signature at the start of text files. Actually
Notepad tries to be way too clever but isn't clever enough:
http://en.wikipedia.org/wiki/Bush_hid_the_facts
This UTF-8 signature is sometimes called a "Byte Order Mark", or BOM,
which is a misnomer. The solution is to use the utf-8-sig encoding
instead of utf-8.
--
Steve
More information about the Tutor
mailing list