[Tutor] A file containing a string of 1 billion random digits.

Sun Jul 18 12:30:05 CEST 2010

On Sun, Jul 18, 2010 at 02:26, Steven D'Aprano <steve at pearwood.info> wrote:
> On Sun, 18 Jul 2010 06:49:39 pm Richard D. Moores wrote:
>
>> I might try
>> trigraphs where the 2nd digit is 2 more than the first, and the third
>> 2 more than the 2nd. E.g. '024', '135', '791', '802'.
>
> Why the restriction? There's only 1000 different trigraphs (10*10*10),
> which is nothing.

Just to see if I could do it.  It seemed interesting.

>> Or maybe I've
>> had enough. BTW Steve, my script avoids the problem you mentioned, of
>> counting 2 '55's in a '555' string. I get only one, but 2 in '5555'.
>
> Huh? What problem did I mention?

Sorry, that was Luke.

> Taking the string '555', you should get two digraphs: 55_ and _55.

That seems wrong to me. When I search on '999999' and there's a
'9999999' I don't want to think I've found 2 instances of '999999'.
But that's just my preference.  Instances should be distinct, IMO, and
not overlap.

> In '5555' you should get three: 55__, _55_, __55. I'd do something like
> this (untested):
>
> trigraphs = {}
> f = open('digits')
> trigraph = f.read(3)  # read the first three digits
> trigraphs[trigraph] = 1
> while 1:
>    c = f.read(1)
>    if not c:
>        break
>    trigraph = trigraph[1:] + c
>    if trigraph in trigraphs:
>        trigraphs[trigraph] += 1
>    else:
>        trigraphs[trigraph] = 1
>> See line 18, in the while loop.
>>
>> I was surprised that I could read in the whole billion file with one
>> gulp without running out of memory.
>
> Why? One billion bytes is less than a GB. It's a lot, but not *that*
> much.

I earlier reported that my laptop couldn't handle even 800 million.

>> Memory usage went to 80% (from
>> the usual 35%), but no higher except at first, when I saw 98% for a
>> few seconds, and then a drop to 78-80% where it stayed.
>
> That suggests to me that your PC probably has 2GB of RAM. Am I close?

No. 4GB.