[Tutor] reading random line from a file

Tue Jul 24 03:26:32 CEST 2007

> Significance of number 4096 :
> file is stored in blocks of size 2K/4K/8K (depending
> upon the machine). file seek for an offset goes block
> by block rather than byte by byte. Hence for file size
> < 4096 (assuming you have 4K block size), you will
> anyway end up scanning it entirely so as well load it
> up in memory.

Mmmm... It depends on the file system. FAT/FAT32 will
read as small a block as a sector size, i.e. 512 bytes. I think
I read somewhere that NTFS is 4K. Ridiculous waste i think.

Anyway... It's dangerous to think one way or the other about
boundaries like that. The only way that 4096 can help you is if you
only start reading on boundary lines, and disable buffering on the OS
level. Otherwise, you will get double and triple buffering occurring.
Perhaps python takes care of this, but it's doubtful. C doesn't by default,
and since python programmers often aren't of the background to
grok how the OS caches reads, it would be extra overhead for a
special case of that most aren't aware.

Mmmm... The OS will read all of those characters in anyway right? 4K.
But if you ask for the data byte by byte, it will copy it to your pointer
byte by byte from the cache instead of copying all of the memory.

Anyway... all this is making my head hurt because I can't quite remember
how it works. (When I last read information about this, I didn't understand 
it's
significance to my programming.)

> But I
> just want to add that since index creation is quite a
> laborious task (in terms of CPU/time) one should do it
> only once (or till file is changed).

Agreed, but it is still better to make the index once at program
start, rather than search through each time a line is requested.

> Thus it should be
> kept on disk and ensure that index is re-created in
> case file changes.

That's a good idea. Especially for large files.

> I would like suggestions on index
> creation.

Creating an index is easy. There are many ways. Here is one.

file_index=[0]
for line in fobj:
    file_index.append(len(line)+file_index[-1])