[Python-3000] iostack, second revision

Josiah Carlson jcarlson at uci.edu
Fri Sep 15 05:01:39 CEST 2006


"Anders J. Munch" <2006 at jmunch.dk> wrote:
> Josiah Carlson wrote:
>  > You were also talking about buffering writes to reduce the overhead of
>  > the underlying seeks and tells because of apparent "optimizations" you
>  > wanted to make. Here is a data integrity optimization you can make for
>  > me: flush when accessing the file non-sequentially, any other behavior
>  > could corrupt the data of users who have been relying on "seek implies
>  > flush".
> 
> Again, that's what explicit calls to flush are for.  And you can't
> violate expectations as to what the seek method does, when there's no
> seek method and no concept of a file pointer.

People who have experience using Python 2.x file objects and/or
underlying platform file handles may have come to expect "seek implies
flush".  Since you claim that offering an unbuffered version is easy,
I'll pretend that such would be offered to the user as an option.

> Sprinkling extra flushes out here and there does not help data
> integrity: Only a flush that is part of a well thought out plan to
> recover partially written data in case of a crash, will help you do
> that.  Anything less, and you're just a power failure and a disk that
> reorders writes away from unrecoverable corruption.

Indeed, whether or not extra flushes help data integrity depends on the
file structure.  But for those who have the know-how to properly deal
with recovery of structured data files post power outage, not flushing
due to optimization is a larger sin than actively flushing - as data may
very well have a better chance to get to disk when you are flushing more
often.


>  > With that said, I'm not sure your FileBytes object is really necessary
>  > or desired for the future io library.  If people want that kind of an
>  > interface, they can use mmap (and push for the various mmap bugs/feature
>  > requests to be fixed), otherwise they should be using readable /
>  > writable / both streams, something that Tomer has been working towards.
> 
> mmap has limitations that cannot be fixed.  It takes up virtual
> memory, limiting the size of files you can work with.  You need to
> specify the size in advance (note the potential race condition in
> f=mmap.mmap(f.fileno(),os.fstat(f.fileno()))).  To what extent does it
> work over networked file systems?  If you map a file on a file system
> that is subsequently unmounted, a core dump may be the result.  All
> this assuming the operating system supports mmap at all.

Some of your concerns can be addressed with mmap + starting offset, and
length parameter of -1.  This results in being able to map arbitrary
portions of the file, as well as a Python-level race-free construction
of an mmap.  Then the FileBytes interface essentially becomes...

class FileBytes(object):
    def __init__(self, fname, mode='r+b'):
        self.f = open(fname, mode)
    def __getitem__(self, key):
        start, stop = self._parseposition(key)
        return mmap.mmap(self.f.fileno(), start=start, stop=stop)
    def __setitem__(self, key, value):
        self[key] = value
    #_parseposition as you specify

With a non-broken platform mmap implementation, multiple identical calls
to __getitem__ will return identical data pointers, or at least the
underlying OS will make sure that the two pointers actually point to the
same physical memory region.


NFS issues are a pain.  This and the non-support of mmaps on smaller or
less developed platforms may be the only situations where not using
mmaps could offer superior failure conditions.


> mmap is for use where speed is paramount, and pretty much only then.
> The reason people don't use sequence-based file interfaces as much is
> that robust, portable, practical sequence-based file interfaces aren't
> available.  Probably most people who would have liked a sequence
> interface do what I do: slurp up the whole file in one read and deal
> with the string.  Or use mmap and live with the fragility.

I've found the opposite to be true.  Every time where I've wanted a
sequence-based file interface, I use an mmap: because it is faster and far
more reliable for all use-cases I've been confronted with (if your
process crashes, all of your writes are flushed).  But I suppose I spend
time with 512M and 1G mmaps, for which constant slicing of strings
and/or a file-based interface is about 100 times too slow (and useless
when a C extension wants to write to the file - mmaps do this for free).


 - Josiah



More information about the Python-3000 mailing list