[Python-ideas] dictionary constructor should not allow duplicate keys

Wed May 4 08:08:39 EDT 2016

On Tue, May 03, 2016 at 05:46:33PM -0700, Luigi Semenzato wrote:

[...]
> > Should duplicate keys be a SyntaxError at compile time, or a TypeError
> > at runtime? Or something else?
> 
> Is there such a thing as a SyntaxWarning? 

Yes there is.

> From my perspective it
> would be fine to make it a SyntaxError, but I am not sure it would be
> overall a good choice for legacy code (i.e. as an error it might break
> old code, and I don't know how many other things a new language
> specification is going to break).
> 
> It could also be a run-time error, but it might be nicer to detect it
> earlier.  Maybe both.

There are serious limits to what the compiler can detect at 
compile-time. So unless you have your heart set on a compile-time 
SyntaxError (or Warning) it might be best to forget all about 
compile-time detection, ignore the question of "literals", and just 
focus on run-time TypeError if a duplicate key is detected.

Otherwise, you will have the situation where Python only detects *some* 
duplicate keys, and your colleague will be cursing that Python does such 
a poor job of detecting duplicates. And it will probably be 
implementation-dependent, e.g. if your Python compiler does constant 
folding it might detect {1+1: None, 2: None} as duplicates, but if it 
doesn't have constant folding (or you have turned it off), it won't.

So with implementation-dependent compile-time checks, you can't even 
guarantee what will be detected. In that case, you might as well use a 
linter.

As far as I am concerned, a compile-time check is next-to-useless. It 
will be more annoying than useful, since it will give people a false 
sense of security, while still missing duplicates.

So I intend to only discuss a run-time solution, which has the advantage 
that Python will detect a much wider range of duplicates: not just:

    {"SPAM": None, "SPAM": None}

for example, but also:

    {"SPAM": None, ("spa".upper() + "AM"[1:]): None}

But now we're getting out of the realm of "detect obvious typos and 
duplicates" and into a more judgemental area. Once you start prohibiting 
complex expressions or function calls that happen return duplicate keys, 
you can no longer be *quite* so sure that this is an error.

Maybe the programmer has some reason for allowing duplicates. Not 
necessarily a *good* reason, but people write all sorts of ugly, bad, 
fragile, silly code without the compiler giving them an error. 
"Consenting adults" applies, and Python generally allows us to shoot 
ourselves in the foot.

Let's say I write something like this:

with open(path) as f:
    d = {
         f.write(a): 'alpha',
         f.write(b): 'beta',
         f.write(c): 'gamma',
         f.write(d): 'delta',
         }

and purely by my bad luck, it turns out that len(b) and len(c) are 
equal, so that there are two duplicate keys. Personally, I think this is 
silly code, but there's no rule against silly code in Python, and maybe 
I have a (good/bad/stupid) reason for writing this. Maybe I don't care 
that duplicate keys are over-written.

If we introduce your rule, that's the same as saying "this code is 
so awful, that we have to ban it... but only *sometimes*, when the 
lengths happen to be equal, the rest of the time it's fine".

Are you okay with saying that? (Not a rhetorical question, and you are 
allowed to say "Yes".)

So which is worse?

- most of the time the code works fine, but sometimes it fails, 
  raising an exception and leaving the file in a half-written state; or

- the code always works fine, except that sometimes a duplicate key
  over-writes the previous value, which I may not even care about.

I don't know which is worse. If there is no clear answer that is 
obviously right, then the status quo wins, even if the status quo is 
less than perfect.

Even if the status quo is *awful*, it may be that all the alternatives 
are just as bad.

I think a compile-time check is just enough to give people a false sense 
of security, and so is *worse* than what we have now. And I'm right on 
the fence regarding a run-time check. 

So to me, my judgement is: with no clear winner, the status quo stays.

> > What counts as "duplicate keys"?  I presume that you mean that two keys
> > count as duplicate if they hash to the same value, and are equal. But
> > you keep mentioning "literals" -- does this mean you care more about
> > whether they look the same rather than are the same?
> 
> Correct.  The errors that I am guessing matter the most are those for
> which folks copy-paste a key-value pair, where the key is a literal
> string, intending to change the key, and then forget to change it.

With respect, I think that is a harmful position to take. That leaves 
the job half-done: the constructor will complain about *some* 
duplicates, but not all, and worse, which ones it detects may depend on 
the implementation you happen to be using!

If it is worth complaining about

    {0: None, 0: None}

then it must also be worth complaining about:

    {0: None, 0.0: None, int(): None, 1-1: None, len(""): None}

etc. Otherwise, I guarantee that somebody will be pulling their hair 
out as to why Python only sometimes detects duplicate keys. Better to 
never detect them at all (so you know you have to test for duplicates 
yourself) than to give a false sense of security.

-- 
Steve