[Tutor] name shortening in a csv module output

Steven D'Aprano steve at pearwood.info
Fri Apr 24 12:08:00 CEST 2015


The quoting seems to be all mangled here, so please excuse me if I 
misattribute quotes to the wrong person:

On Thu, Apr 23, 2015 at 04:15:39PM -0700, Jim Mooney wrote:

> So is there any way to sniff the encoding, including the BOM (which appears
> to be used or not used randomly for utf-8), so you can then use the proper
> encoding, or do you wander in the wilderness?
> 
> Pretty much guesswork.

There is no foolproof way to guess encodings, since you might happen to 
have text which *looks* like it starts with a BOM but is actually not. 
E.g. if you happen to have text which starts with þÿ then it might be 
identified as a BOM even though the author intended it to actually be þÿ 
in the Latin-1 encoding.

This is no different from any other file format. All files are made from 
bytes, and there is no such thing as "JPEG bytes" and "TXT bytes" and 
"MP3 bytes", they're all the same bytes. In principle, you could have a 
file which was a valid JPG, ZIP and WAV file *at the same time* 
(although not likely by chance). And if not those specific three, then 
pick some other combination.

*But*, while it is true that in principle you cannot guess the encoding 
of files, in practice you often can guess quite successfully. Checking 
for a BOM is easy. For example, you could use these two functions:


def guess_encoding_from_bom(filename, default='undefined'):
    with open(filename, 'rb') as f:
        sig = f.read(4)
    return bom2enc(sig, default)

# Untested.
def bom2enc(bom, default=None):
    if bom.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')):
        return 'utf_32'
    elif bom.startswith((b'\xFE\xFF', b'\xFF\xFE')):
        return 'utf_16'
    elif bom.startswith(b'\xEF\xBB\xBF'):
        return 'utf_8_sig'
    elif bom.startswith(b'\x2B\x2F\x76'):
        if len(bom) == 4 and bom[4] in b'\x2B\x2F\x38\x39':
            return 'utf_7'
    elif bom.startswith(b'\xF7\x64\x4C'):
        return 'utf_1'
    elif default is None:
        raise ValueError
    else:
        return default


If you have a BOM, chances are very good that the text is encoded 
correctly. If not, you can try decoding, and if it fails, try again:

for encoding in ('utf-8', 'utf-16le', 'utf-16be'):
    try:
        with open("filename", "r", encoding=encoding) as f:
            return f.read()
    except UnicodeDecodingError:
        pass
raise UnicodeDecodingError("I give up!")


You can use chardet to guess the encoding. That's a Python port of the 
algorithm used by Firefox to guess the encoding of webpages when the 
declaration is missing:

https://pypi.python.org/pypi/chardet

chardet works by statistically analysing the characters in the text and 
tries to pick an encoding that minimizes the number of gremlins.

Or you can try fixing errors after they occur:

http://blog.luminoso.com/2012/08/20/fix-unicode-mistakes-with-python/


> This all sounds suspiciously like the old browser wars I suffered while
> webmastering. I'd almost think Microsoft had a hand in it ;')  

Ha! In a way they did, because Microsoft Windows code pages are legacy 
encodings. But given that there were no alternatives back in the 1980s, 
I can't blame Microsoft for the mess in the past. (Besides, MS are one 
of the major sponsors of Unicode, so they are helping to fix the problem 
too.)

> If utf-8 can
> handle a million characters, why isn't it just the standard? I doubt we'd
> need more unless the Martians land.

UTF-8 is an encoding, not a character set. The character set tells us 
what characters we can use:

D ñ Ƕ λ Ъ ᛓ ᾩ ‰ ℕ ℜ ↣ ⊂

are all legal in Unicode, but not in ASCII, Latin-1, ISO-8859-7, BIG-5, 
or any of the dozens of other character sets.

The encoding tells us how to turn a character like A or λ into one or 
more bytes to be stored on disk, or transmitted over a network. In the 
non-Unicode legacy encodings, we often equate the character set with the 
encoding (or codec), e.g. we say that ASCII A *is* byte 65 (decimal) or 
41 (hex). But that's sloppy language. The ASCII character set includes 
the character A (but not λ) while the encoding tells us character A gets 
stored as byte 41 (hex).

To the best of my knowledge, Unicode is the first character set which 
allows for more than one encoding.

Anyway, Unicode is *the* single most common character set these days, 
more common than even ASCII and Latin-1. About 85% of webpages use 
UTF-8.


> Since I am reading opinions that the BOM doesn't even belong in utf-8, can
> I assume just using utf-8-sig as my default encoding, even on a non-BOM
> file, would do no harm?

Apparently so. It looks like utf_8-sig just ignores the sig if it is 
present, and uses UTF-8 whether the signature is present or not.

That surprises me.


-- 
Steve


More information about the Tutor mailing list