[Python-ideas] Change magic strings to enums

Wed Apr 25 06:21:48 EDT 2018

On Wed, Apr 25, 2018 at 10:06:56AM +0200, Jacco van Dorp wrote:

> Perhaps the string encode/decode would be a better case, tho. Is it
> latin 1 or latin-1 ? utf-8 or UTF-8 ? 

py> 'abc'.encode('latin 1') == 'abc'.encode('LATIN-1')
True

py> 'abc'.encode('utf8') == 'abc'.encode('UTF 8') == 'abc'.encode('UtF_8')
True

Encoding names are normalised before being used.

> They might be fast to look up if
> you know where to look (probably the top result of googling "python
> string encoding utf 8", and it's the second and first option
> respectively IIRC. But I shouldn't -have- to recall correctly), but
> it's still a lot faster if you can type "Encoding.U" and it gives you
> the option.

If you did this with Encodings.ISO you would get a couple of dozen 
possibilities. 

ISO-8859-1
ISO-8859-7
ISO-8859-14
ISO-8859-15

etc, just to pick a few at random. How do you know which one you want?

In general, there's not really much *practical* use-case for code 
completion on encodings, aside from just exploratory mucking about in 
the interactive interpreter.

There are too many codecs (multiple dozen), the names are too similar 
and not self-explanatory, and they can have aliases. It would be like 
doing code-completion on an object and getting a couple of dozen methods 
looking like 

  method1245    method1246    method1247    method2390    method2395

Besides, aside from UTF-16, UTF-8 and ASCII, we shouldn't encourage 
the use of most codecs except for legacy data. And when working with 
legacy data, we really need to know ahead of time what the encoding 
is, and declare it as constant or application option.

(Or, worst case, we've used chardet or another encoding guesser, and 
stored the name of the encoding in a variable.)

I don't really see a big advantage aside from laziness for completing 
on encodings. And while laziness is a virtue in programmers, that only 
goes so far before it becomes silly. Having to type

    import encodings
    enc <tab> .Enc <tab> .u <tab> arrow arrow arrow arrow arrow arrow enter

(19 key presses, plus the import) to save from having to type

    'utf8'

(six keypresses) is not what I would call efficient use of programmer 
time and effort.

(Why so many arrows? Since you'll have to tab past at least

    utf16
    utf16be
    utf16le
    utf32
    utf32be
    utf32le
    utf7

before you get to utf8.)

But the biggest problem is that they aren't currently available for 
introspection anywhere. You can register new codecs, but there's no API 
for querying the list of currently registered codecs or their aliases. 
I think that problem would need to be solved first, in which case code 
completion will then be either easy, or irrelevant.

(I'd be perfectly satisfied with an API I could call from the 
interactive interpreter.)

-- 
Steve