[Tutor] A file containing a string of 1 billion random digits.
Steven D'Aprano
steve at pearwood.info
Sun Jul 18 14:49:29 CEST 2010
On Sun, 18 Jul 2010 08:30:05 pm Richard D. Moores wrote:
> > Taking the string '555', you should get two digraphs: 55_ and _55.
>
> That seems wrong to me. When I search on '999999' and there's a
> '9999999' I don't want to think I've found 2 instances of '999999'.
> But that's just my preference. Instances should be distinct, IMO,
> and not overlap.
I think we're talking about different things here. You're (apparently)
interested in searching for patterns, in which case looking for
non-overlapping patterns is perfectly fine. I'm talking about testing
the randomness of the generator by counting the frequency of digraphs
and trigraphs, in which case you absolutely do want them to overlap.
Otherwise, you're throwing away every second digraph, or two out of
every three trigraphs, which could potentially hide a lot of
non-randomness.
> >> I was surprised that I could read in the whole billion file with
> >> one gulp without running out of memory.
> >
> > Why? One billion bytes is less than a GB. It's a lot, but not
> > *that* much.
>
> I earlier reported that my laptop couldn't handle even 800 million.
What do you mean, "couldn't handle"? Couldn't handle 800 million of
what? Obviously not bytes, because your laptop *can* handle well over
800 million bytes. It has 4GB of memory, after all :)
There's a big difference in memory usage between (say):
data = "1"*10**9 # a single string of one billion characters
and
data = ["1"]*10**9 # a list of one billion separate strings
or even
number = 10**(1000000000)-1 # a one billion digit longint
This is just an example, of course. As they say, the devil is in the
details.
> >> Memory usage went to 80% (from
> >> the usual 35%), but no higher except at first, when I saw 98% for
> >> a few seconds, and then a drop to 78-80% where it stayed.
> >
> > That suggests to me that your PC probably has 2GB of RAM. Am I
> > close?
>
> No. 4GB.
Interesting. Presumably the rest of the memory is being used by the
operating system and other running applications and background
processes.
--
Steven D'Aprano
More information about the Tutor
mailing list