[Tutor] A file containing a string of 1 billion random digits.

Steven D'Aprano steve at pearwood.info
Sun Jul 18 14:49:29 CEST 2010


On Sun, 18 Jul 2010 08:30:05 pm Richard D. Moores wrote:

> > Taking the string '555', you should get two digraphs: 55_ and _55.
>
> That seems wrong to me. When I search on '999999' and there's a
> '9999999' I don't want to think I've found 2 instances of '999999'.
> But that's just my preference.  Instances should be distinct, IMO,
> and not overlap.

I think we're talking about different things here. You're (apparently) 
interested in searching for patterns, in which case looking for 
non-overlapping patterns is perfectly fine. I'm talking about testing 
the randomness of the generator by counting the frequency of digraphs 
and trigraphs, in which case you absolutely do want them to overlap. 
Otherwise, you're throwing away every second digraph, or two out of 
every three trigraphs, which could potentially hide a lot of 
non-randomness.


> >> I was surprised that I could read in the whole billion file with
> >> one gulp without running out of memory.
> >
> > Why? One billion bytes is less than a GB. It's a lot, but not
> > *that* much.
>
> I earlier reported that my laptop couldn't handle even 800 million.

What do you mean, "couldn't handle"? Couldn't handle 800 million of 
what? Obviously not bytes, because your laptop *can* handle well over 
800 million bytes. It has 4GB of memory, after all :)

There's a big difference in memory usage between (say):

data = "1"*10**9  # a single string of one billion characters

and 

data = ["1"]*10**9  # a list of one billion separate strings

or even

number = 10**(1000000000)-1  # a one billion digit longint

This is just an example, of course. As they say, the devil is in the 
details.


> >> Memory usage went to 80% (from
> >> the usual 35%), but no higher except at first, when I saw 98% for
> >> a few seconds, and then a drop to 78-80% where it stayed.
> >
> > That suggests to me that your PC probably has 2GB of RAM. Am I
> > close?
>
> No. 4GB.

Interesting. Presumably the rest of the memory is being used by the 
operating system and other running applications and background 
processes.



-- 
Steven D'Aprano


More information about the Tutor mailing list