[ python-Bugs-1460340 ] random.sample can raise KeyError

Sat Apr 1 02:41:55 CEST 2006

Bugs item #1460340, was opened at 2006-03-28 19:05
Message generated for change (Comment added) made by tim_one
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1460340&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: Python 2.3
>Status: Closed
>Resolution: Fixed
Priority: 5
Submitted By: paul rubin (phr)
>Assigned to: Nobody/Anonymous (nobody)
Summary: random.sample can raise KeyError

Initial Comment:
I have only tested this in 2.3 and the relevant code in
random.py has changed in the current svn branch, but
from inspection it looks to me like the bug may still
be there.

If you initialize a dictionary as follows:

 a={}.fromkeys(range(10)+range(10,100,2)+range(100,110))

then

 random.sample(a,3)

raises KeyError most times that you call it.

----------------------------------------------------------------------

>Comment By: Tim Peters (tim_one)
Date: 2006-03-31 19:41

Message:
Logged In: YES 
user_id=31435

Closing again; dicts work "by brute force" now, on trunk and
2.4 branch.

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2006-03-31 19:05

Message:
Logged In: YES 
user_id=31435

It's as easy to "make dicts work" as it is to forbid them,
and since the test suite has implied for 30 months that
dicts do work (by virtue of testing them with sample()), I
wouldn't be comfortable making them raise an exception
without going through the deprecation business.  Fine by me
if P3K drops sample(dict) support, though.

WRT dicts, sampling from .values() doesn't make sense to me,
but sampling from .keys() does.  "for x in dict" iterates
over the keys, and sample() clearly intends to try to
generalize the documented "sequence" to "iterable" (albeit
that it's not entirely successful, it's still useful, as
especially in the case of sets).

Online algorithms aren't attractive here because they take
expected-case O(len(sequence)) time even for a sample size
of 1.  The current algorithms take O(k) time, and k (unlike
the population) is almost always small.  I suppose some
yahoo may be doing sample(seq, len(seq)) as an idiotic way
to spell shuffle(seq), but that's just natural selection in
action ;-)

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2006-03-31 07:29

Message:
Logged In: YES 
user_id=80475

It should be documented. Was noting that it (and dicts) 
currently are not.  That gives us the freedom to change it.

----------------------------------------------------------------------

Comment By: paul rubin (phr)
Date: 2006-03-31 06:42

Message:
Logged In: YES 
user_id=72053

This notion of supporting sets without documenting it also
seems screwy.  I defer to you guys whether sets or dicts
need to be supported, but if either is supported to the
extent that someone is going to bother maintaining the code
and test suite for it, IMO the support should be documented.  

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2006-03-31 06:37

Message:
Logged In: YES 
user_id=80475

Sampling dict.values is screwy.  If they want to sample a 
dictionary, they can be explicit about what they want by 
using d.keys() or d.values().

I prefer to completely forgo dict support
and stick with sequences (as documented)
and sets (supported but not documented).

----------------------------------------------------------------------

Comment By: paul rubin (phr)
Date: 2006-03-31 05:30

Message:
Logged In: YES 
user_id=72053

It can handle dicts easily enough by sampling from
population.values() instead of list(population), if it knows
it's dealing with a dict.  On clpy I suggested adding
s.values() for sets so that the values in dicts and sets
(and multisets if those get added) could be handled
uniformly, but  reception hasn't been good.  

I see there's an operator.isSequenceType function, that
presumably checks for built-in sequence types, not arbitrary
classes with __getattr__ methods that make them act like
sequences.  But maybe it's close enough to what we want,
that the docs can say that random.sample is limited to
sequences, sets, and (maybe) dicts.  If it were up to me, 
I'd check for a .values() operation on non-sequences and use
it if available, or raise TypeError otherwise.  Then I'd add
that operation to sets.

There is also a very cute algorithm for doing random
sampling from sequences of unknown length (iterators).  I'm
sure Tim and Ray know about it.  Maybe I'll try coding it. 
The k=1 case (basically random.choice) looks like:
  for n,x in enumerate(seq):
    if random()*(n+1) < 1: c=x
  return [c]

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2006-03-31 04:54

Message:
Logged In: YES 
user_id=80475

It probably should not support dicts at all and should 
make some effort to detect them and raise a TypeError.

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2006-03-30 22:55

Message:
Logged In: YES 
user_id=31435

Assigned to me.  The current situation is unacceptable,
because internal code comments and the test suite were left
implying that random.sample() "should work" for dicts -- but
it doesn't, and the failure modes are both subtle and silent.

Note that my first example was utterly vanilla, using a
small dict with a contiguous range of integer keys.  That's
not asking sample() to crack a safe, it's asking it to
borrow candy from a dead baby ;-)

I don't care a lot about the second example, but it would
would also work right if dicts were forced into sample()'s
first internal algorithm (and potential optimization be
damned in the case of a dict).

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2006-03-30 05:18

Message:
Logged In: YES 
user_id=80475

Bah.  I'm not overly concerned.  It is a Python fact of 
life that objects defining __getitem__ cannot aways be 
clearly categorized as being either a sequence or a 
mapping but not both.  You can add some additional checks 
like checking for a keys() method, but there is a limit to 
what you can do against these safe-cracking style efforts 
to trick the routine.  I hope this quest for theoretical 
perfection doesn't lead to throwing the baby out with the 
bathwater.  It would be ashamed to lose the automated 
choice of the best performing algorithm.  If that happens, 
somebody's real-world use cases are certain to suffer.

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2006-03-29 08:02

Message:
Logged In: YES 
user_id=31435

Ya, this is flaky for dicts; re-opening.  For example,

>>> d = dict((i, complex(i, i)) for i in xrange(30))
>>> random.sample(d, 5)  # ask for 5 and it samples values
[(25+25j), (4+4j), (16+16j), (13+13j), (17+17j)]
>>> random.sample(d, 6)  # ask for 6 and it samples keys
[20, 11, 9, 4, 14, 1]

That one doesn't have to do with internal exceptions, it has
only to do with which of sample()'s internal algorithms gets
used.

Paul's point about potential bias is also a worry.  Here's a
full example:

"""
from random import sample
d = dict.fromkeys(range(24))
d['x'] = 'y'
count = hits = 0
while 1:
    count += 1
    hits += sample(d, 1) == ['x']
    if count % 10000 == 0:
        print count, "%.2f%%" % (100.0 * hits / count)
"""

Since len(d) == 25, I'd hope to see 'x' selected 1/25 = 4%
of the time.  Instead it gets selected 0.16% of the time
(1/25**2 -- and Paul's analysis of why is on target).

----------------------------------------------------------------------

Comment By: paul rubin (phr)
Date: 2006-03-29 05:07

Message:
Logged In: YES 
user_id=72053

Actually the previous comment is wrong too; 99% of the time,
sample(a,1) will return None since that's the value
connected to every key in the dictionary, i.e. it's
population[j] for every j.  The other 1% of the time, the
dict gets converted to a list, and the sample returns a key
from the dict rather than a value, which is certainly wrong.
 And you can see how the probabilities are still messed up
if the values in the dict are distinct.

I think it's ok to give up on dicts, but some warning should
about it be added to the manual unless dict-like things
somehow get detected in the code.  It would be best to test
for the object really being a sequence, but I don't know if
such a test exists.  Maybe one should be added.

I'll leave it to you guys to reopen this bug if appropriate.

----------------------------------------------------------------------

Comment By: paul rubin (phr)
Date: 2006-03-29 04:46

Message:
Logged In: YES 
user_id=72053

I don't think the fix at 43421 is quite right, but I can't
easily test it in my current setup.  Suppose

a = dict.fromkeys(range(99) + ['x'])
b = random.sample(a,1)

99% of the time, there's no KeyError and b gets set to [j]
where j is some random integer.

1% of the time, there's a KeyError, random.sample is called
recursively, and the recursive call returns [some integer j]
99% of the time, and returns ['x'] 1% of the time.

So in total, ['x'] gets returned .01% of the time instead of
1% of the time.

I think it's better to not set result[i]=population[j]
inside the loop.  Instead, just build up the selected set
until it has enough indices; then try to make a result list
using those indices, and if there's a KeyError, convert the
population to a list and use the same selection set to make
the results.

gbrandl also correctly points out that a dict is not a
sequence type, so maybe it's ok to just punt on dicts.  But
it's obvious from the code comments that somebody once
wanted dicts to work, and it's reasonable for people to want
sets to work.

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2006-03-29 04:17

Message:
Logged In: YES 
user_id=80475

Checked-in a fix.
See revision 43420 and 43421.

----------------------------------------------------------------------

Comment By: Georg Brandl (gbrandl)
Date: 2006-03-29 04:11

Message:
Logged In: YES 
user_id=849994

random.sample is documented to take a sequence as its first
argument. A dict is not a sequence type.

I think you want
random.sample(a.keys(), 3)
or, for large dicts,
random.sample(a.iterkeys(), 3)

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1460340&group_id=5470