random number generator, entropy and pickling

Hi there, We are courrently having a discussion on the scikits learn mailing list about which patterns to adopt for random number generation. One thing that is absolutely clear is that making the stream of random numbers reproducible is critical. We have several objects that can serve as random variate generators. So far, we instanciate these objects with a optional seed or PRNG argument, as in: def __init__(self, prng=None): if prng is None: prng = np.random self.prng = prng The problem with this pattern is that np.random doesn't pickle, and therefore the objects do not pickle by default. A bit of pickling magic would solve this, but we'd rather avoid it. We thought that we could simply have a PRNG per object, as in: def __init__(self, prng=None): if prng is None: prng = np.random.RandomState() self.prng = prng I don't like this option, because it means that with a given pieve of code, setting the seed of numpy's PRNG isn't enough to make it reproducible. I couldn't retrieve a handle on a picklable instance for the global PRNG. The only option I can see would be to use the global numpy PRNG to seed an instance specific RandomState, as in: def __init__(self, prng=None): if prng is None: prng = np.random.RandomState(np.random.random()) self.prng = prng That way seeding the global PRNG really does control the full random number generation. I am wondering if it would have an adverse consequence on the entropy of the stream of random numbers. Does anybody have suggestions? Advices? Cheers, Gael

On Mon, Apr 25, 2011 at 11:57, Gael Varoquaux <gael.varoquaux@normalesup.org> wrote:
Hi there,
We are courrently having a discussion on the scikits learn mailing list about which patterns to adopt for random number generation. One thing that is absolutely clear is that making the stream of random numbers reproducible is critical. We have several objects that can serve as random variate generators. So far, we instanciate these objects with a optional seed or PRNG argument, as in:
def __init__(self, prng=None): if prng is None: prng = np.random self.prng = prng
The problem with this pattern is that np.random doesn't pickle, and therefore the objects do not pickle by default. A bit of pickling magic would solve this, but we'd rather avoid it.
We thought that we could simply have a PRNG per object, as in:
def __init__(self, prng=None): if prng is None: prng = np.random.RandomState() self.prng = prng
I don't like this option, because it means that with a given pieve of code, setting the seed of numpy's PRNG isn't enough to make it reproducible.
I couldn't retrieve a handle on a picklable instance for the global PRNG.
It's accessible as np.random.mtrand._rand, though we have kept it "private" for a reason. The Option (a) from the original thread on scikits-learn-general, "use your own default global RandomState instance in scikits.learn", would be preferable.
The only option I can see would be to use the global numpy PRNG to seed an instance specific RandomState, as in:
def __init__(self, prng=None): if prng is None: prng = np.random.RandomState(np.random.random()) self.prng = prng
That way seeding the global PRNG really does control the full random number generation. I am wondering if it would have an adverse consequence on the entropy of the stream of random numbers. Does anybody have suggestions? Advices?
Use a single, common default PRNG, either np.random.mtrand._rand or your own. Don't use multiple seeds from a PRNG. Use a utility function to avoid repeating yourself, even if it's just a one-liner. In this case, it's important that everyone do exactly the same thing for consistency, both inside scikits.learn and in code that uses or extends scikits.learn. The best way to ensure that is to provide a utility function as the One, Obvious Way To Do It. Note that if you do hide the details behind a utility function, I would remove my objection to using np.random.mtrand._rand. ;-) -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

On Mon, Apr 25, 2011 at 9:57 AM, Gael Varoquaux <gael.varoquaux@normalesup.org> wrote:
We thought that we could simply have a PRNG per object, as in:
def __init__(self, prng=None): if prng is None: prng = np.random.RandomState() self.prng = prng
I don't like this option, because it means that with a given pieve of code, setting the seed of numpy's PRNG isn't enough to make it reproducible.
I couldn't retrieve a handle on a picklable instance for the global PRNG.
The only option I can see would be to use the global numpy PRNG to seed an instance specific RandomState, as in:
def __init__(self, prng=None): if prng is None: prng = np.random.RandomState(np.random.random()) self.prng = prng
That way seeding the global PRNG really does control the full random number generation. I am wondering if it would have an adverse consequence on the entropy of the stream of random numbers. Does anybody have suggestions? Advices?
If code A relies on code B (eg, some numpy function) and code B changes, then the stream of random numbers will no longer be the same. The point here is that the user wrote code A but depended on code B, and even though code A was unchanged, their random numbers were not the same. The situation is improved if scikits.learn used its own global RandomState instance. Then code A will at least give the same stream of random numbers for a fixed version of scikits.learn. It should be made very clear though that the data stream cannot be expected to be the same across versions. As to each object having its own RandomState instance, I definitely see that it makes restoring the overall state of a piece of code harder, but perhaps utility functions could make this easier. I can imagine that users might want to arbitrarily set the seed for a particular object in the midsts of a larger piece of code. Perhaps the user is testing known failure cases of a algorithm A interacting with algorithm B. If the user wants to loop over known seeds which cause algorithm A to fail but needs algorithm B to keep its seed fixed, then it seems like having a global seed makes this more difficult. In that sense, it might be desirable to have "independent" prngs. On the other hand, maybe that is an uncommon use case that could be handled through manually setting the seed. Post back on what you guys decide.

On Mon, Apr 25, 2011 at 11:05:05AM -0700, T J wrote:
If code A relies on code B (eg, some numpy function) and code B changes, then the stream of random numbers will no longer be the same. The point here is that the user wrote code A but depended on code B, and even though code A was unchanged, their random numbers were not the same.
Yes, that's exactly why we want the different objects to able to recieve their own PRNG.
The situation is improved if scikits.learn used its own global RandomState instance. Then code A will at least give the same stream of random numbers for a fixed version of scikits.learn. It should be made very clear though that the data stream cannot be expected to be the same across versions.
The use case that we are trying to catter for, with the global PRNG, is for mister Joe average, who is used to setting the numpy PRNG to control what is going on. In my experience, the less you need to teach Mr Joe A., the better (I am not dumbing down Joe A., just acknowledging the fact that he probably has many other things to worry about).
As to each object having its own RandomState instance, I definitely see that it makes restoring the overall state of a piece of code harder, but perhaps utility functions could make this easier.
That's what we are leaning toward: a utility function, that by default returns the numpy PRNG object, but enables the use of specific PNRGs or seeds. In other words, we are thinking of following Robert's suggestion (option 'a' in the original mail, but enriched with Robert's input on mtrand.rand). We'll probably wait a bit for feedback before making a decision. Thanks for all your input, G

On Mon, Apr 25, 2011 at 13:15, Gael Varoquaux <gael.varoquaux@normalesup.org> wrote:
On Mon, Apr 25, 2011 at 11:05:05AM -0700, T J wrote:
If code A relies on code B (eg, some numpy function) and code B changes, then the stream of random numbers will no longer be the same. The point here is that the user wrote code A but depended on code B, and even though code A was unchanged, their random numbers were not the same.
Yes, that's exactly why we want the different objects to able to recieve their own PRNG.
But seriously, they are running A+B, the combination of A and B. If A+B changes to A+B', then the results may be different. That's to be expected.
The situation is improved if scikits.learn used its own global RandomState instance. Then code A will at least give the same stream of random numbers for a fixed version of scikits.learn. It should be made very clear though that the data stream cannot be expected to be the same across versions.
The use case that we are trying to catter for, with the global PRNG, is for mister Joe average, who is used to setting the numpy PRNG to control what is going on.
Honestly, they really shouldn't be, except as a workaround to poorly-written functions that don't let you pass in your own PRNG. Someone snuck in the module-level alias to the global PRNG's seed() method when I wasn't paying attention. :-) -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

On Mon, Apr 25, 2011 at 01:23:12PM -0500, Robert Kern wrote:
Yes, that's exactly why we want the different objects to able to recieve their own PRNG.
But seriously, they are running A+B, the combination of A and B. If A+B changes to A+B', then the results may be different. That's to be expected.
Fair enough. Let's say the example was not ideal, but we still want to be able to control the random number generation of an algorithm independently of what's going on elsewhere. That's why we are happy to be able to have a PNRG dedicated to that processing pipeline. I think that everybody agrees with that.
The use case that we are trying to catter for, with the global PRNG, is for mister Joe average, who is used to setting the numpy PRNG to control what is going on.
Honestly, they really shouldn't be, except as a workaround to poorly-written functions that don't let you pass in your own PRNG.
Right, but many users transit from Matlab, where they learn this pattern. I am not interested in fighting against user's behavior unless I have a very good reason. What they do in their code is their problem. G
participants (3)
-
Gael Varoquaux
-
Robert Kern
-
T J