[Tutor] A file containing a string of 1 billion random digits.

Richard D. Moores rdmoores at gmail.com
Sun Jul 18 15:22:15 CEST 2010


On Sun, Jul 18, 2010 at 05:49, Steven D'Aprano <steve at pearwood.info> wrote:
> On Sun, 18 Jul 2010 08:30:05 pm Richard D. Moores wrote:
>
>> > Taking the string '555', you should get two digraphs: 55_ and _55.
>>
>> That seems wrong to me. When I search on '999999' and there's a
>> '9999999' I don't want to think I've found 2 instances of '999999'.
>> But that's just my preference.  Instances should be distinct, IMO,
>> and not overlap.
>
> I think we're talking about different things here.

Yes. I was as interested in finding non-overlapping patterns as
testing randomness, I suppose because we wouldn't have been sure about
the randomness anyway.

>You're (apparently)
> interested in searching for patterns, in which case looking for
> non-overlapping patterns is perfectly fine. I'm talking about testing
> the randomness of the generator by counting the frequency of digraphs
> and trigraphs, in which case you absolutely do want them to overlap.
> Otherwise, you're throwing away every second digraph, or two out of
> every three trigraphs, which could potentially hide a lot of
> non-randomness.
>
>
>> >> I was surprised that I could read in the whole billion file with
>> >> one gulp without running out of memory.
>> >
>> > Why? One billion bytes is less than a GB. It's a lot, but not
>> > *that* much.
>>
>> I earlier reported that my laptop couldn't handle even 800 million.
>
> What do you mean, "couldn't handle"? Couldn't handle 800 million of
> what? Obviously not bytes,

I meant what the context implied. Bytes. Look back in this thread to
see my description of my laptop's problems.

>because your laptop *can* handle well over
> 800 million bytes. It has 4GB of memory, after all :)
>
> There's a big difference in memory usage between (say):
>
> data = "1"*10**9  # a single string of one billion characters
>
> and
>
> data = ["1"]*10**9  # a list of one billion separate strings
>
> or even
>
> number = 10**(1000000000)-1  # a one billion digit longint
>
> This is just an example, of course. As they say, the devil is in the
> details.

Overkill, Steve.

>> >> Memory usage went to 80% (from
>> >> the usual 35%), but no higher except at first, when I saw 98% for
>> >> a few seconds, and then a drop to 78-80% where it stayed.
>> >
>> > That suggests to me that your PC probably has 2GB of RAM. Am I
>> > close?
>>
>> No. 4GB.
>
> Interesting. Presumably the rest of the memory is being used by the
> operating system and other running applications and background
> processes.

I suppose so.

Dick


More information about the Tutor mailing list