file ops on the fly ?

Alex Martelli aleax at aleax.it
Sat Jul 20 04:56:49 EDT 2002


Shagshag13 wrote:

> 
> "Alex Martelli" <aleax at aleax.it> a écrit dans le message de news:
> 21TZ8.101254$vm5.3505549 at news2.tin.it...
>> Shagshag13 wrote:
>>
>> > hello,
>> >
>> > is it possible (and how) to do "on the fly" reading and updating on
>> > file ?
>>
>> You can open a file for 'r+b' and then call both read and write on it,
>> moving around with seek and possibly tell.  This is hardly ever useful
>> on a text file, though.  But I guess much depends on what you mean by
>> "on the fly" in this context -- maybe the simulation of inplace edits
>> for text files offered by standard module fileinput is all you need?
> 
> thanks, i'm going to look at fileinput...
> 
> ans yes i'm working on big file (>1go). From a sequential reading (line by
> line) i need to do line insertion and/or line deletion in my file. And i
> don't wish to create a new (wasting place) file.

fileinput's _simulation_ of inplace edits actually creates a new
file.  If you can afford it -- i.e. if you have enough free space
on disk for the new-file creation -- it's no waste at all: the old
file will disappear as soon as the sequential pass is done and
the new one will take the old one's name.  It's an *investment*
of temporary disk use to gain huge speed and simplicity.

Think of a file as an array of bytes on disk.  What does it
imply to "insert a line" somewhere in the middle, actually
operating on that disk space rather than using auxiliary
space elsewhere on the disk (given that your file does not
fit in memory)?  Think about it.  It implies inserting an
arbitrary number of bytes without damaging the following
ones.  How do you do that?

Answer: you MOVE all the following bytes forward by just
enough.  So you're reading on average half a gigabyte (the
part of your file that follows the insertion point) and
writing half a gigabyte back to disk.  For ANY insertion,
be it even seven bytes or whatever.  How many insertions
are you going to do?  A dozen?  That's about 12 gigabytes
of data you're moving.  *shudder*.

If you do a sequential pass and writing to a new file you're
reading 1gig, writing 1gig -- total 2 gigabytes data moved
rather than 12.  If you do 100 insertions, still 2 gigabytes
rather than 50.  And so on.  You could speed your program up
by 100, 1000, or ten thousand times, dependiong on the number
of insertions.

Are you STILL inclined to call it a "waste" ... ?-)

Deletions are much the same.  Coding the insertions and the
deletions is tricky and bug-prone (while doing them while
sequentially copying is trivial), but THAT might perhaps be
finessed by a refined implementation in a library... but
there's no way to finesse the ENORMOUS slow-down you'd incur.

Some INCREDIBLY clever buffering algorithm might perhaps be
able to DOUBLE the speed of your desired in-place-editing
approach depending on your patterns of deletions and
insertions, by using (not wasting:-) memory most cleverly.
So you'd only slow down by say 500 times insted of by 1000.
Do you think it would be worth the huge coding effort?


Buy a new disk.  They're cheap, VERY cheap.


Or, change your data's structure.  A flat text file does
not really make much sense when dealing with gigabytes,
particularly for data that needs to be updated in-place
(conceptually) such as yours.  *INVEST* (not a waste!-)
some disk space into structuring your data in a way that
is a closer match for your application's needs...!


Alex




More information about the Python-list mailing list