What strategy for random accession of records in massive FASTA file?

Steve Holden steve at holdenweb.com
Sat Jan 15 21:24:56 CET 2005

Bulba! wrote:

> On 14 Jan 2005 12:30:57 -0800, Paul Rubin
> <http://phr.cx@NOSPAM.invalid> wrote:
>>Mmap lets you treat a disk file as an array, so you can randomly
>>access the bytes in the file without having to do seek operations
> Cool!
>>Just say a[234]='x' and you've changed byte 234 of the file to the
>>letter x.  
> However.. however.. suppose this element located more or less
> in the middle of an array occupies more space after changing it, 
> say 2 bytes instead of 1. Will flush() need to rewrite the half of
> mmaped file just to add that one byte? 
Nope. If you try a[234] = 'banana' you'll get an error message. The mmap 
protocol doesn't support insertion and deletion, only overwriting.

Of course, it's far too complicated to actually *try* this stuff before 
pontificating  [not]:

  >>> import mmap
  >>> f = file("/tmp/Xout.txt", "r+")
  >>> mm = mmap.mmap(f.fileno(), 200)
  >>> mm[1:10]
'elcome to'
  >>> mm[1] = "banana"
Traceback (most recent call last):
   File "<stdin>", line 1, in ?
IndexError: mmap assignment must be single-character string
  >>> mm[1:10] = 'ishing ::'
  >>> mm[1:10]
'ishing ::'
  >>> mm[1:10] = 'a'
Traceback (most recent call last):
   File "<stdin>", line 1, in ?
IndexError: mmap slice assignment is wrong size

> flush() definitely makes updating less of an issue,  I'm just 
> curious about the cost of writing small changes scattered all 
> over the place back to the large file.
Some of this depends on whether the mmap is shared or private, of 
course, but generally speaking you can ignore the overhead, and the 
flush() calls will be automatic as long as you don't mix file and string 
operations. The programming convenience is amazing.

> --
> I have come to kick ass, chew bubble gum and do the following:
> from __future__ import py3k
> And it doesn't work.

So make it work :-)

Steve Holden               http://www.holdenweb.com/
Python Web Programming  http://pydish.holdenweb.com/
Holden Web LLC      +1 703 861 4237  +1 800 494 3119

More information about the Python-list mailing list