[Tutor] Codec lookup, was Re: name shortening in a csv module output

Sat Apr 25 10:21:11 CEST 2015

Steven D'Aprano wrote:

> On Fri, Apr 24, 2015 at 04:34:19PM -0700, Jim Mooney wrote:
> 
>> I was looking things up and although there are aliases for utf_8 (utf8
>> and utf-8) I see no aliases for utf_8_sig, so I'm surprised the utf-8-sig
>> I tried using, worked at all. Actually, I was trying to find the file
>> where the aliases are so I could change it and have utf_8_sig called up
>> when I used utf8, but it appears to be hard-coded.
> 
> I believe that Python's codecs system automatically normalises the
> encoding name by removing spaces, dashes and underscores, but I'm afraid
> that either I don't understand how it works or it is buggy:
> 
> py> 'Hello'.encode('utf___---___  -- ___8')  # Works.
> b'Hello'
> 
> py> 'Hello'.encode('ut-f8')  # Fails.
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> LookupError: unknown encoding: ut-f8

I don't think this is a bug.

Normalization of the name converts to lowercase and collapses arbitrary 
sequences of punctuation into a single "_".

The lookup that follows maps "utf8" to "utf_8" via a table:

>>> [n for n, v in encodings.aliases.aliases.items() if v == "utf_8"]
['utf8_ucs2', 'utf8', 'u8', 'utf', 'utf8_ucs4']

Hm, who the heck uses "u8"? I'd rather go with

>>> encodings.aliases.aliases["steven_s_preferred_encoding"] = "utf_8"
>>> "Hello".encode("--- Steven's preferred encoding ---")
b'Hello'

;)