[Tutor] A file containing a string of 1 billion random digits.

Sun Jul 18 02:19:06 CEST 2010

On Sat, Jul 17, 2010 at 15:36, ALAN GAULD <alan.gauld at btinternet.com> wrote:
>> Now that you see what I want to do with 1 billion random
> digits,
>> please give me your suggestion(s). As I mentioned
> before,
>> I'm very new to reading from and writing to files.
>
> The way old text editors used to do this - in the days when we only had 16K RAM etc!
> was to use buffers and pages. Let's say you have a 100K buffer size and read in the
> first buffer full of text. When you get down to say 75K you delete the first 50K and
> load the next 50K from the file. Your cursor is now 25K into a new set of 100K.
>
> This allows you to page back a certain amount and page forward a lot. And so
> you progress losing 50% from top or bottom and loading the next/previous chunk
> into RAM. Its non-trivial to get right but in your case you are only looking to
> sequentially process so you only need to look forward so not quite so bad...
>
> You need markers to keep your location and other markers to keep track of
> the buffers location within the file.
>
> If you really are only doing sequential processing you can dispense with
> the buffer/pages and just read small chunks such that you always have two
> in memory at once. The chunks might only be a few Kb in size depending
> on the longest sequence you are looking for (a chunk of twice that size is
> a good guide)

The longest sequence I can imagine finding with, say, a .1 probability
is less than 30 bytes, I'd guess.

To quote myself from my earlier reply to you:

"If I processed in chunks, they would need to overlap
slightly, because the sequences to be found, such as 05251896 (my
father's birth date) could overlap the boundary between 2 neighboring
chunks, and thus not be located."

I don't fully understand your "you always have two [chunks] in memory
at once". Does that take care of the overlap I think I need? I don't
want you or anyone to give me the the whole answer, but could you
clarify a bit about keeping 2 chunks in memory? Let's say the chunks,
in order, and without overlap, are A, B, C, D, E, ...  So I'd have AB
to search, then BC, then CD, then DE, ... ? But then what if I find an
instance of, say, my father's birthdate in B when searching AB. Then
I'd find the same instance again in BC. How do I prevent that from
counting as 2 instances?

> The seek() tell() and read() methods are your friends for this kind of work.

I hope they're not shy about telling me their secrets. Would your
tutorial be a good place to start to inquire of them?

There's this table in chapter 9 of Learning Python, 4th ed., but no
tell(). And a search on  'tell('  gets not a hit in the whole book
(the PDF). Is tell() a Python 2.x thing? It's a built-in for 2.x
(<http://docs.python.org/library/stdtypes.html?highlight=tell#file.tell>),
but not, it seems, for 3.x. I'm not complaining -- just sayin'.

Table 9-2. Common file operations
Operation Interpretation
output = open(r'C:\spam', 'w') Create output file ('w' means write)
input = open('data', 'r') Create input file ('r' means read)
input = open('data') Same as prior line ('r' is the default)
aString = input.read() Read entire file into a single string
aString = input.read(N) Read up to next N characters (or bytes) into a string
aString = input.readline() Read next line (including \n newline) into a string
aList = input.readlines() Read entire file into list of line strings (with \n)
output.write(aString) Write a string of characters (or bytes) into file
output.writelines(aList) Write all line strings in a list into file
output.close() Manual close (done for you when file is collected)
output.flush() Flush output buffer to disk without closing
anyFile.seek(N) Change file position to offset N for next operation
for line in open('data'): use line File iterators read line by line
open('f.txt', encoding='latin-1') Python 3.0 Unicode text files (str strings)
open('f.bin', 'rb') Python 3.0 binary bytes files (bytes strings)

Thanks again, Alan.

Dick