[Python-ideas] Fwd: [Python Ideas] Random weighted choice idea to add to random lib

Wed Mar 23 16:43:13 EDT 2016

Hello

I'm going a change from my example code and talk about the suggestions
I would make to the issue I mentioned because as you said it does have
accumulation of floating point errors and is a bit simplistic.  Maybe
it is better to be simple but not at the expense of gross inefficiency
of (case of O(log(n)) vs O(n)).

So here's how I'd change his code from the issue here:
http://bugs.python.org/issue18844
the code on here:
http://bugs.python.org/review/18844/patch/12674/46425

I've marked what I would insert with a plus and a remove with a minus
at the beginning of the line

First only expose the weighted_choice function and leave the generator
as the backend.  Keep it simple. so from here on I'd change
weighted_choice_generator to _weighted_choice_generator

- "SystemRandom", "weighted_choice_generator", "weighted_choice"]
+ "SystemRandom", "weighted_choice"]

Second we could store the previous data to prevent having to recreate
the binary tree

def weighted_choice(self, data):
  + # we could store a copy of the data to prevent remaking the tree however
  + # this will take up order(n) space to go to sublinear time in search
  + # or store a hash of the value
  + if self.hashval == hash(tuple(data)):
      + return next(self._weighted_choice_generator(data))
   + else:
       + return self._weighted_choice_generator(data)

+ # Other approach is to do away with the function above and simply
rename the function
+ # below as weighted_choice because even this would remake the tree
but efficiently find the value each time
def _weighted_choice_generator(self, data):

I dislike of having to create two new functions into random lib API
when I think only really one is required.  Ultimately I think we could
do away with the top function and simply keep the generator function
but call it weighted_choice.

On Wed, Mar 23, 2016 at 12:11 PM, Andrew Barnert <abarnert at yahoo.com> wrote:
> On Mar 23, 2016, at 09:43, Steven Basart <xksteven at uchicago.edu> wrote:
>>
>> Hello Andrew,
>>
>>> On Wed, Mar 23, 2016 at 11:13 AM, Andrew Barnert <abarnert at yahoo.com> wrote:
>>>
>>> On Mar 23, 2016, at 07:51, Steven Basart <xksteven at uchicago.edu> wrote:
>>>
>>> Hello,
>>>
>>> I had an idea that has in some form been proposed before.  I think a random weighted choice function would be a very good addition to the random module in the python standard library.
>>>
>>> Here's a basic implementation of what's required albeit slightly inefficient as the search is linear as opposed to creating a binary tree and taking log n time to find the correct choice. pmf is short for probability mass function. It doesn't handle all error cases but it is more of an example to show the idea.
>>>
>>>
>>>    def randweighted(self,pmf=list):
>>>        """Return an int in the range [0,len(list)] weighted by list of values.
>>>        Raises error if list is empty.
>>>        """
>>>        if(not pmf):
>>>            raise ValueError("pmf is equal to empty list")
>>>        cummassf = [sum(pmf[:i]) for i in pmf]
>>>        randVal = self._randbelow(len(pmf))
>>>        for x in cummassf:
>>>            if randVal < x:
>>>                return x
>>>
>>>
>>> Why a default value of list, especially since that's just going to raise a TypeError later on?
>>>
>>> Why the (unpythonic) test for "pmf == []" instead of just testing "if not pmf:"? Then it'll work on, say, tuples or third-party sorted lists or any other sequence. (With a tiny change, you can make it work on any iterable, but I don't know whether that's desirable.)
>>>
>>> The x[:i] is accessing an unbound local named x. What's it supposed to be?
>> It was supposed to be
>>
>> cummassf = [sum(pmf[:i]) for i in pmf]
>>
>> Sorry about that I updated the above example accordingly.
>>
>>> Whatever x is, each element of cummassf is going to be some kind of sequence, so randVal < x on each element is going to raise a TypeError. Maybe you wanted a sum somewhere?
>> Yes this is correct and I updated how it was supposed to be above.
>>> If you fix that, doesn't this accumulate floating point errors, so there's a significant chance the result won't add up to 1.0 even if all of the weights are correct to 1 ULP?
>>>
>>> If the weights _aren't_ correct, surely you'd want a ValueError or something. This code will just return None if the sum is too low, and silently choose from the first 1.0 if it's too high. (I think np.random.choice raises ValueError if not isclose(sum(p), 1.0), but I'd have to check.)
>> I think
>> To fix this we could we could do one of two options:
>>
>> 1) Normalize the inputs
>> total = 0
>> total += sum(pmf)
>
> Why not just total = sum(pmf)?
>
> Also, this isn't normalizing. Normalizing means scaling by the appropriate factor.
>
>> check if total is close to 1.0
>> then
>> ... (same as code above)
>>
>> finally set cummassf[-1] = 1.0
>>
>> 2) Scale the random variable by the total amount. In this case there
>> is no need to take into account the floating point error.
>
> There very definitely is a need to take into account the floating point error! Adding an extra operation (a division) doesn't make the accumulated error go away, it makes it worse.
>
> I believe PEP 450 (for the statistics module) has a good discussion of accumulating errors in running sums, and how to deal with it.
>
> Meanwhile, from a very quick survey, it looks like many implementations allow integer weights (which you get directly out of Counter) and do exact rather than inexact math in those cases. That seems like a good idea.
>
>> 3) The last idea I had was to instead of returning the value we could
>> return the index of which position to get as well as the actual value.
>>         index = 0
>>         for x in cummassf:
>>             index += 1
>>             if randVal < x:
>>                 return (index, pmf[index])
>
> Use enumerate; it's a builtin for a reason.
>
> Meanwhile, what use is returning the weight at all? What would a caller do with it?
>
> But again, why are you trying to write this from scratch if someone already wrote a working version?
>
>>> Finally, this is clearly a weighted randrange (with a fixed min of 0), but it claims to be a weighted choice. Obviously you can build the latter out of the former (e.g., take a val:weight mapping, call randweighted on the values, then do next(islice(keys, result, None)), but you do need to actually build it (and show what the API for it will be) if that's the motivating reason. (Also, I think an _unnormalized_ val:weight mapping might be a better API--after all, that means you can immediately call it on a Counter--but you probably want to look at what numpy, R, Mathematica, etc. do...)
>>>
>>> There was a ticket that got started before.  It has since stalled because I believe the original author intended the library to get called over an over so they came up with generators which would be more efficient if passing in the same list.  Otherwise there isn't really a point to using generators in my opinion.
>>>
>>> http://bugs.python.org/issue18844
>>>
>>>
>>> Why not just pick up from their original implementation, fix any bugs, and propose that, instead of starting over from scratch?
>> I've never contributed to the python package before so I'm honestly
>> not sure how to go about updating the previous ticket.  I just knew
>> how to find it.
>> I did check out the source code from mercurial and started asking on
>> irc about it.  Someone there suggested I present my idea here.
>> I wanted to give a simple version of what the code would perhaps look
>> like in the python-ideas mailing list and also discuss the main api of
>> what the function would do.  I just believe it is a really useful
>> function and having it available for many users would save time.
>
> Sure, but you don't seem to be presenting a simplified version of the code you found, but instead writing completely different code that doesn't work, doesn't have the API you want, and doesn't make it clear what you're trying to do. Again, why not just post working code with the right interface by copying the existing implementation or stripping it down?
>
> Meanwhile, there definitely is a point to the generators. It's very common to want to do N random choices in a row. For unweighted choice, that's fine; just call choice N times. But for weighted choice, there's a linear-time setup followed by a sublinear-time choice function, so just calling the whole thing N times will take O(NM) time, while doing the setup only once takes O(M+NlogM) or something. Most of the functions from NumPy and other languages that I found in a quick google do this by returning N values instead of 1, which works, but it seems like a generator is a much more pythonic way of implementing the same idea.
>
> Also, it doesn't seem to meaningfully complicate things. With your implementation (once you get it working), all you have to do is wrap the last bit in a "while True:" and change the "return" to "yield" and you're done.
>
>