What strategy for random accession of records in massive FASTA file?
Steve Holden
steve at holdenweb.com
Sat Jan 15 15:24:56 EST 2005
Bulba! wrote:
> On 14 Jan 2005 12:30:57 -0800, Paul Rubin
> <http://phr.cx@NOSPAM.invalid> wrote:
>
>
>>Mmap lets you treat a disk file as an array, so you can randomly
>>access the bytes in the file without having to do seek operations
>
>
> Cool!
>
>
>>Just say a[234]='x' and you've changed byte 234 of the file to the
>>letter x.
>
>
> However.. however.. suppose this element located more or less
> in the middle of an array occupies more space after changing it,
> say 2 bytes instead of 1. Will flush() need to rewrite the half of
> mmaped file just to add that one byte?
>
Nope. If you try a[234] = 'banana' you'll get an error message. The mmap
protocol doesn't support insertion and deletion, only overwriting.
Of course, it's far too complicated to actually *try* this stuff before
pontificating [not]:
>>> import mmap
>>> f = file("/tmp/Xout.txt", "r+")
>>> mm = mmap.mmap(f.fileno(), 200)
>>> mm[1:10]
'elcome to'
>>> mm[1] = "banana"
Traceback (most recent call last):
File "<stdin>", line 1, in ?
IndexError: mmap assignment must be single-character string
>>> mm[1:10] = 'ishing ::'
>>> mm[1:10]
'ishing ::'
>>> mm[1:10] = 'a'
Traceback (most recent call last):
File "<stdin>", line 1, in ?
IndexError: mmap slice assignment is wrong size
>>>
> flush() definitely makes updating less of an issue, I'm just
> curious about the cost of writing small changes scattered all
> over the place back to the large file.
>
Some of this depends on whether the mmap is shared or private, of
course, but generally speaking you can ignore the overhead, and the
flush() calls will be automatic as long as you don't mix file and string
operations. The programming convenience is amazing.
> --
> I have come to kick ass, chew bubble gum and do the following:
>
> from __future__ import py3k
>
> And it doesn't work.
So make it work :-)
regards
Steve
--
Steve Holden http://www.holdenweb.com/
Python Web Programming http://pydish.holdenweb.com/
Holden Web LLC +1 703 861 4237 +1 800 494 3119
More information about the Python-list
mailing list