Atomic file.get(offset, length)

I wish Python binary file objects had an atomic seek-read method, so I wouldn't have to perform my own locking everywhere to prevent other threads from moving the file pointer between seek and read. Is this something that can be bubbled up from the underlying platform? I think the Linux C equivalent is pread. I also think Java has something like this but can't find a reference now. Or does this exist and I missed it? (On a mmap file this is trivial, of course.) Has this been discussed before? Cheers, Matt

On 7/21/2012 2:59 PM, Matt Chaput wrote:
If you are reading a file from multiple threads, I suggest you write your own seek_and_read_with_locks function that does exactly what you need in one place. Or add a .readx method to a subclass.
Is this something that can be bubbled up from the underlying platform? I think the Linux C equivalent is pread.
If there is a standard posix function that is not yet wrapped in os, you can propose its addition. But some research to see has widespread and actually standardized it is. -- Terry Jan Reedy

On Sat, Jul 21, 2012 at 12:35 PM, Terry Reedy <tjreedy@udel.edu> wrote:
"man pread" on OS/X suggests it exists there too. I presume the use case is to have a large data file open for reading by multiple threads. This is a reasonable use case and it makes some sense to extend our binary readable streams (buffered and unbuffered) with an API for this purpose. However, it's probably just efficient to just have a separate open stream per thread -- I doubt that open file descriptors are scarcer resources than threads, and I presume the kernel will happily share any buffering it does on behalf of multiple open files referencing the same file. If you're worried about the buffer space, the default buffer size is 8K, which is hardly worth mentioning compared to the default thread stack allocation. Depending on your use case you may get away with an unbuffered stream just fine. This approach seems better than implementing something using locks (since the locks create contention that is not inherent in the problem) and is available right now, without waiting for Python 3.4 to be released... -- --Guido van Rossum (python.org/~guido)

On Sun, 22 Jul 2012 15:36:44 -0700 Guido van Rossum <guido@python.org> wrote:
It doesn't. I guess we could add an "offset" keyword-only argument to read() and write(), but then we need to provide a Windows implementation as well (it seems using overlapped I/O with ReadFile() / WriteFile() could make it possible). Also, I'm not sure it makes sense for buffered I/O, or only unbuffered. Regards Antoine. -- Software development and contracting: http://pro.pitrou.net

On Sun, Jul 22, 2012 at 4:47 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Given that the use case is to avoid race conditions when more than one thread is doing random-access reads on the same open file, I think it makes some sense to implement it for both buffered and unbuffered streams -- and even for text streams, since those support seek() as well, so the race condition exists for those too. But note that the pread() man page (at least the one I checked :-) specifies that pread() doesn't affect the file pointer. So I suppose it should also not affect the buffer. That may make it hard to implement it for text streams (which IIRC rely quite heavy on buffering for their implementation), but it should be easy for buffered streams: it should just be passed on to the underlying unbuffered stream. (For those jumping in the middle of the thread: I know it's past the feature freeze, so these considerations are for 3.4. Also, os.pread() is in 3.3.) -- --Guido van Rossum (python.org/~guido)

On Sun, 22 Jul 2012 17:16:57 -0700 Guido van Rossum <guido@python.org> wrote:
Indeed, it should not affect the buffer. That's why I'm questioning the addition of this feature to buffered streams (whose whole point is their implicit buffer management). Also, there are implementation subtleties when e.g. reading from an area which overlaps the current buffer :-) As you pointed out, I think a reasonable solution to the race condition problem is to use several file descriptors. It may not work so well if you also write to the file, though. Regards Antoine. -- Software development and contracting: http://pro.pitrou.net

On Mon, Jul 23, 2012 at 6:00 PM, Guido van Rossum <guido@python.org> wrote:
If you write and read a file from multiple threads you're crazy.
Perhaps, though I could imagine a fine-grained-locking DB doing this with constant sized data structures. Though that might be a good time to pull this one out: http://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock

On 21Jul2012 13:35, Guido van Rossum <guido@python.org> wrote: | On Sat, Jul 21, 2012 at 12:35 PM, Terry Reedy <tjreedy@udel.edu> wrote: | > On 7/21/2012 2:59 PM, Matt Chaput wrote: | >> I wish Python binary file objects had an atomic seek-read method, so | >> I wouldn't have to perform my own locking everywhere to prevent other | >> threads from moving the file pointer between seek and read. [...] | >> Is this something that can be bubbled up from the underlying | >> platform? I think the Linux C equivalent is pread. [...] | "man pread" on OS/X suggests it exists there too. I presume the use | case is to have a large data file open for reading by multiple | threads. This is a reasonable use case and it makes some sense to | extend our binary readable streams (buffered and unbuffered) with an | API for this purpose. On most Linux boxen you can say: man 3p pread which will show you the POSIX man page, if it exists. And it does! So pread will exist on pretty much every UNIX platform, and I'd be amazed if it wasn't on Windows. In fact, it remarks that pread appeared in SysVr4, which is quite old. | However, it's probably just efficient to just have a separate open | stream per thread It doubles the system call count per read (if pread is a system call, which it ually will be (it is on Linux and MacOSX, and is hard to implement otherwise without an annoying and slow locking scheme concealed inside the C library). I'd be +1 for adding pread and pwrite to the os module. It seems reasonable and quite useful and should work on most platforms. Cheers, -- Cameron Simpson <cs@zip.com.au> Rimmer: It will be happened; it shall be going to be happening; it will be was an event that could will have been taken place in the future. - Red Dwarf, _Future Echoes_

On 7/21/2012 2:59 PM, Matt Chaput wrote:
If you are reading a file from multiple threads, I suggest you write your own seek_and_read_with_locks function that does exactly what you need in one place. Or add a .readx method to a subclass.
Is this something that can be bubbled up from the underlying platform? I think the Linux C equivalent is pread.
If there is a standard posix function that is not yet wrapped in os, you can propose its addition. But some research to see has widespread and actually standardized it is. -- Terry Jan Reedy

On Sat, Jul 21, 2012 at 12:35 PM, Terry Reedy <tjreedy@udel.edu> wrote:
"man pread" on OS/X suggests it exists there too. I presume the use case is to have a large data file open for reading by multiple threads. This is a reasonable use case and it makes some sense to extend our binary readable streams (buffered and unbuffered) with an API for this purpose. However, it's probably just efficient to just have a separate open stream per thread -- I doubt that open file descriptors are scarcer resources than threads, and I presume the kernel will happily share any buffering it does on behalf of multiple open files referencing the same file. If you're worried about the buffer space, the default buffer size is 8K, which is hardly worth mentioning compared to the default thread stack allocation. Depending on your use case you may get away with an unbuffered stream just fine. This approach seems better than implementing something using locks (since the locks create contention that is not inherent in the problem) and is available right now, without waiting for Python 3.4 to be released... -- --Guido van Rossum (python.org/~guido)

On Sun, 22 Jul 2012 15:36:44 -0700 Guido van Rossum <guido@python.org> wrote:
It doesn't. I guess we could add an "offset" keyword-only argument to read() and write(), but then we need to provide a Windows implementation as well (it seems using overlapped I/O with ReadFile() / WriteFile() could make it possible). Also, I'm not sure it makes sense for buffered I/O, or only unbuffered. Regards Antoine. -- Software development and contracting: http://pro.pitrou.net

On Sun, Jul 22, 2012 at 4:47 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Given that the use case is to avoid race conditions when more than one thread is doing random-access reads on the same open file, I think it makes some sense to implement it for both buffered and unbuffered streams -- and even for text streams, since those support seek() as well, so the race condition exists for those too. But note that the pread() man page (at least the one I checked :-) specifies that pread() doesn't affect the file pointer. So I suppose it should also not affect the buffer. That may make it hard to implement it for text streams (which IIRC rely quite heavy on buffering for their implementation), but it should be easy for buffered streams: it should just be passed on to the underlying unbuffered stream. (For those jumping in the middle of the thread: I know it's past the feature freeze, so these considerations are for 3.4. Also, os.pread() is in 3.3.) -- --Guido van Rossum (python.org/~guido)

On Sun, 22 Jul 2012 17:16:57 -0700 Guido van Rossum <guido@python.org> wrote:
Indeed, it should not affect the buffer. That's why I'm questioning the addition of this feature to buffered streams (whose whole point is their implicit buffer management). Also, there are implementation subtleties when e.g. reading from an area which overlaps the current buffer :-) As you pointed out, I think a reasonable solution to the race condition problem is to use several file descriptors. It may not work so well if you also write to the file, though. Regards Antoine. -- Software development and contracting: http://pro.pitrou.net

On Mon, Jul 23, 2012 at 6:00 PM, Guido van Rossum <guido@python.org> wrote:
If you write and read a file from multiple threads you're crazy.
Perhaps, though I could imagine a fine-grained-locking DB doing this with constant sized data structures. Though that might be a good time to pull this one out: http://en.wikipedia.org/wiki/Readers%E2%80%93writer_lock

On 21Jul2012 13:35, Guido van Rossum <guido@python.org> wrote: | On Sat, Jul 21, 2012 at 12:35 PM, Terry Reedy <tjreedy@udel.edu> wrote: | > On 7/21/2012 2:59 PM, Matt Chaput wrote: | >> I wish Python binary file objects had an atomic seek-read method, so | >> I wouldn't have to perform my own locking everywhere to prevent other | >> threads from moving the file pointer between seek and read. [...] | >> Is this something that can be bubbled up from the underlying | >> platform? I think the Linux C equivalent is pread. [...] | "man pread" on OS/X suggests it exists there too. I presume the use | case is to have a large data file open for reading by multiple | threads. This is a reasonable use case and it makes some sense to | extend our binary readable streams (buffered and unbuffered) with an | API for this purpose. On most Linux boxen you can say: man 3p pread which will show you the POSIX man page, if it exists. And it does! So pread will exist on pretty much every UNIX platform, and I'd be amazed if it wasn't on Windows. In fact, it remarks that pread appeared in SysVr4, which is quite old. | However, it's probably just efficient to just have a separate open | stream per thread It doubles the system call count per read (if pread is a system call, which it ually will be (it is on Linux and MacOSX, and is hard to implement otherwise without an annoying and slow locking scheme concealed inside the C library). I'd be +1 for adding pread and pwrite to the os module. It seems reasonable and quite useful and should work on most platforms. Cheers, -- Cameron Simpson <cs@zip.com.au> Rimmer: It will be happened; it shall be going to be happening; it will be was an event that could will have been taken place in the future. - Red Dwarf, _Future Echoes_
participants (8)
-
Antoine Pitrou
-
Cameron Simpson
-
Guido van Rossum
-
Mark Lawrence
-
Matt Chaput
-
Terry Reedy
-
Victor Stinner
-
Yuval Greenfield