[Tutor] name shortening in a csv module output

Steven D'Aprano steve at pearwood.info
Fri Apr 24 02:09:54 CEST 2015


On Wed, Apr 22, 2015 at 10:18:31PM -0700, Jim Mooney wrote:

> My result:
> 
> Ï»¿First Name        Last Name           # odd characters on header line

Any time you see "odd characters" in text like that, you should 
immediately think "encoding problem".

These odd characters are normally called mojibake, a Japanese term, or 
sometimes gremlins. Mojibake occurs when your text file was written in 
one encoding but then read back using another, e.g.:

* file was written using a Windows code page, but read back on a 
  different PC using a different code page (say, Greek then Russian);

* file was written on a classic Macintosh (pre-OS X), then read on a 
  DOS or Windows machine;

* file was written on a mainframe that uses ASCII and transferred to
  a mainframe that uses EBCDIC;

* file was written using an editor that defaults to one encoding, 
  and read back using an editor that defaults to a different encoding;

* HTML web page was saved using one encoding, but not declared (or
  declared wrongly) and the browser defaults to a different encoding.


Sadly, this problem is hard to solve because text files can't, in 
general, contain metadata that tells you how the text is stored. We're 
reduced to conventions, hints, and guesses. Given that, the fact that 
there is so little mojibake in the world compared to how much there 
could be is astonishing.


In your case, those specific three characters Ï»¿ found at the start of 
a text file indicates that the text file was saved by Notepad on 
Windows. For no good reason (although I'm sure it seemed like a good 
idea at the time, if they were smoking crack at the time), Notepad 
sometimes puts a UTF-8 signature at the start of text files. Actually 
Notepad tries to be way too clever but isn't clever enough:

http://en.wikipedia.org/wiki/Bush_hid_the_facts
‎
This UTF-8 signature is sometimes called a "Byte Order Mark", or BOM, 
which is a misnomer. The solution is to use the utf-8-sig encoding 
instead of utf-8.



-- 
Steve


More information about the Tutor mailing list