[Tutor] name shortening in a csv module output
Steven D'Aprano
steve at pearwood.info
Fri Apr 24 12:08:00 CEST 2015
The quoting seems to be all mangled here, so please excuse me if I
misattribute quotes to the wrong person:
On Thu, Apr 23, 2015 at 04:15:39PM -0700, Jim Mooney wrote:
> So is there any way to sniff the encoding, including the BOM (which appears
> to be used or not used randomly for utf-8), so you can then use the proper
> encoding, or do you wander in the wilderness?
>
> Pretty much guesswork.
There is no foolproof way to guess encodings, since you might happen to
have text which *looks* like it starts with a BOM but is actually not.
E.g. if you happen to have text which starts with þÿ then it might be
identified as a BOM even though the author intended it to actually be þÿ
in the Latin-1 encoding.
This is no different from any other file format. All files are made from
bytes, and there is no such thing as "JPEG bytes" and "TXT bytes" and
"MP3 bytes", they're all the same bytes. In principle, you could have a
file which was a valid JPG, ZIP and WAV file *at the same time*
(although not likely by chance). And if not those specific three, then
pick some other combination.
*But*, while it is true that in principle you cannot guess the encoding
of files, in practice you often can guess quite successfully. Checking
for a BOM is easy. For example, you could use these two functions:
def guess_encoding_from_bom(filename, default='undefined'):
with open(filename, 'rb') as f:
sig = f.read(4)
return bom2enc(sig, default)
# Untested.
def bom2enc(bom, default=None):
if bom.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')):
return 'utf_32'
elif bom.startswith((b'\xFE\xFF', b'\xFF\xFE')):
return 'utf_16'
elif bom.startswith(b'\xEF\xBB\xBF'):
return 'utf_8_sig'
elif bom.startswith(b'\x2B\x2F\x76'):
if len(bom) == 4 and bom[4] in b'\x2B\x2F\x38\x39':
return 'utf_7'
elif bom.startswith(b'\xF7\x64\x4C'):
return 'utf_1'
elif default is None:
raise ValueError
else:
return default
If you have a BOM, chances are very good that the text is encoded
correctly. If not, you can try decoding, and if it fails, try again:
for encoding in ('utf-8', 'utf-16le', 'utf-16be'):
try:
with open("filename", "r", encoding=encoding) as f:
return f.read()
except UnicodeDecodingError:
pass
raise UnicodeDecodingError("I give up!")
You can use chardet to guess the encoding. That's a Python port of the
algorithm used by Firefox to guess the encoding of webpages when the
declaration is missing:
https://pypi.python.org/pypi/chardet
chardet works by statistically analysing the characters in the text and
tries to pick an encoding that minimizes the number of gremlins.
Or you can try fixing errors after they occur:
http://blog.luminoso.com/2012/08/20/fix-unicode-mistakes-with-python/
> This all sounds suspiciously like the old browser wars I suffered while
> webmastering. I'd almost think Microsoft had a hand in it ;')
Ha! In a way they did, because Microsoft Windows code pages are legacy
encodings. But given that there were no alternatives back in the 1980s,
I can't blame Microsoft for the mess in the past. (Besides, MS are one
of the major sponsors of Unicode, so they are helping to fix the problem
too.)
> If utf-8 can
> handle a million characters, why isn't it just the standard? I doubt we'd
> need more unless the Martians land.
UTF-8 is an encoding, not a character set. The character set tells us
what characters we can use:
D ñ Ƕ λ Ъ ᛓ ᾩ ‰ ℕ ℜ ↣ ⊂
are all legal in Unicode, but not in ASCII, Latin-1, ISO-8859-7, BIG-5,
or any of the dozens of other character sets.
The encoding tells us how to turn a character like A or λ into one or
more bytes to be stored on disk, or transmitted over a network. In the
non-Unicode legacy encodings, we often equate the character set with the
encoding (or codec), e.g. we say that ASCII A *is* byte 65 (decimal) or
41 (hex). But that's sloppy language. The ASCII character set includes
the character A (but not λ) while the encoding tells us character A gets
stored as byte 41 (hex).
To the best of my knowledge, Unicode is the first character set which
allows for more than one encoding.
Anyway, Unicode is *the* single most common character set these days,
more common than even ASCII and Latin-1. About 85% of webpages use
UTF-8.
> Since I am reading opinions that the BOM doesn't even belong in utf-8, can
> I assume just using utf-8-sig as my default encoding, even on a non-BOM
> file, would do no harm?
Apparently so. It looks like utf_8-sig just ignores the sig if it is
present, and uses UTF-8 whether the signature is present or not.
That surprises me.
--
Steve
More information about the Tutor
mailing list