[Python-ideas] dictionary constructor should not allow duplicate keys

Wed May 4 11:27:14 EDT 2016

On 5/3/2016 9:09 PM, Steven D'Aprano wrote:
> On Mon, May 02, 2016 at 02:36:35PM -0700, Luigi Semenzato wrote:
>
>> The original problem description:
>>
>> lives_in = { 'lion': ['Africa', 'America'],
>>              'parrot': ['Europe'],
>>              #... 100+ more rows here
>>              'lion': ['Europe'],
>>              #... 100+ more rows here
>>            }
>>
>> The above constructor overwrites the first 'lion' entry silently,
>> often causing unexpected behavior.
> Did your colleague really have 200+ items in the dict? No matter, I 
> suppose. The same principle applies.
>
> When you have significant amount of data in a dict (or any other data 
> structure, such as a list, tree, whatever), the programmer has to take 
> responsibility for the data validation. Not the compiler. Out of all the 
> possible errors, why is "duplicate key" so special? Your colleague could 
> have caused unexpected behaviour in many ways:

I often use large literal dicts, with literal string keys.  There are
many times (a couple times a month) I add a duplicate key because I am
prone to making mistakes like that.  Now, I am lucky because I purchased
an IDE that highlights those duplicates as errors.  If I had to rely on
a separate linter; with my duplicate keys hiding in a long list of other
trivial formatting mistakes, then the time cost of reviewing the linting
output on every code change is greater than the time cost of simply
debugging the program when I see it inevitably malfunction.  Instead, I
would use data validation in my own code.

> The data validation need not be a big burden. In my own code, unless the 
> dict is so small that I can easily see that it is correct with my own 
> eyes, I always follow it with an assertion:
>
> assert len(lives_in) == 250
>

I would not use this check, it is error prone, given the number of times
I update the dicts during development.  Either I will loose count, or
the line math is wrong (because of extra spaces), or I would count the
duplicate.   A better consistency check requires I construct all literal
dicts like:

def add_item(dict_, key, value):
    if key in dict_:
        raise Exception()
    dict_[key] = value

lives_in = {}
add_item(lives_in, 'lion', ['Africa', 'America'])
add_item(lives_in, 'parrot', ['Europe'])
# ... 100+ more rows here
add_item(lives_in, 'lion', ['Europe'])
# ... 100+ more rows here

Which is inelegant, but necessary to minimize the programmer time wasted
trying to detect these duplicate keys manually.