
shutil.copy*() use copyfileobj(): """ while 1: buf = fsrc.read(length) if not buf: break fdst.write(buf) """ This allocates and frees a lot of buffers, and could be optimized with readinto(). Unfortunately, I don't think we can change copyfileobj(), because it might be passed objects that don't implement readinto(). By implementing it directly in copyfile() (it would probably be better to expose it in shutil to make it available to tarfile & Co), there's a modest improvement: $ dd if=/dev/zero of=/tmp/foo bs=1M count=100 Without patch: $ ./python -m timeit -s "import shutil" "shutil.copyfile('/tmp/foo', '/dev/null')" 10 loops, best of 3: 218 msec per loop With readinto(): $ ./python -m timeit -s "import shutil" "shutil.copyfile('/tmp/foo', '/dev/null')" 10 loops, best of 3: 202 msec per loop (I'm using /dev/null as target because my hdd is really slow: other benchmarks are welcome, just beware that /tmp might be tmpfs). I've also written a dirty patch to use sendfile(). Here, the improvement is really significant: With sendfile(): $ ./python -m timeit -s "import shutil" "shutil.copyfile('/tmp/foo', '/dev/null')" 100 loops, best of 3: 5.39 msec per loop Thoughts? cf

Am 03.03.2013 18:02, schrieb Charles-François Natali:
sendfile() is a Linux-only syscall. It's also limited to certain kinds of file descriptors. The limitations have been lifted in recent kernel versions. http://linux.die.net/man/2/sendfile TL;DR the input fd must support mmap. The output fd used to be socket fd only, since 2.6.33 sendfile() supports any fd as output fd.

Or we could just use: if hasattr(fileobj, 'readinto') hoping that readinto() is really a readinto() implementation and not an unrelated method :-)
No, it's not Linux-only, many BSD also have it, although all don't support an arbitrary output file descriptor (Solaris does allow regular files too). It would be possible to catch EINVAL/EBADF, and fall back to a regular copy loop. Note that the above benchmark is really biased by writing the data to /dev/null: with a real target file, the zero-copy wouldn't bring such a large gain, because the bottleneck will really be the I/O devices (also a read()/write() loop is more expensive in Python than in C). But I see at least two cases where it could be interesting: when reading/writing from/to a tmpfs partition, or when the source and target files are on different disks. I'm not sure it's worth it though, that's why I'm asking here :-) (but I do think readinto() is interesting).

On Sun, 3 Mar 2013 20:55:15 +0100 Charles-François Natali <cf.natali@gmail.com> wrote:
Can you post your benchmark's code? I could time it on a SSD.
Attached (for readinto() and sendfile()).
Ok, the readinto() version doesn't seem to make a difference here, only the sendfile() version is beneficial (and the benefits are mostly noticeable from tmpfs to /dev/null, as you point out :-)). Regards Antoine.

IMNSHO the *time* is less relevant than the fact that it uses less memory by not repeatedly making copies. In general we should use the more recent non-copying APIs when possible within the standard library but most of that code is pretty old and has not been look at for conversion. Any such changes are welcome in 3.4+. On Sun, Mar 3, 2013 at 11:00 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:

On Sun, 3 Mar 2013 13:38:05 -0800 "Gregory P. Smith" <greg@krypto.org> wrote:
IMNSHO the *time* is less relevant than the fact that it uses less memory by not repeatedly making copies.
Well, it doesn't repeatedly make copies, it just allocates a new buffer every loop. At best, it will consume 16 KB instead of 32 KB. Regards Antoine.

On 03.03.13 19:02, Charles-François Natali wrote:
8%. Note that in real cases the difference will be significant less. First, output to real file requires more time than output to /dev/null. Second, you unlikely copy the same input file 30 times in a row. Only first time in the test you read from disk, and 29 times you read from cache. Third, such sources as tarfile has several level between user code and disk file. BufferedIO, GzipFile, internal tarfile wrapper. Every level adds some overhead and in sum this will be many times larger then creating of one bytes object.
This looks more interesting. There are other idea to speedup tarfile extracting. Use dir_fd parameter (if it is available) for opening of target files. It can speedup extracting of a large count of small and deeply nested files. sendfile() should speedup extracting only large files.

I know, I said it was really biased :-) The proper way to perform a cold cache benchmark would be "echo 3 > /proc/sys/vm/drop_caches" before reading the file. The goal was to highlight the reallocation cost (whose cost can vary depending on the implementation).
Not really, because like above, the extra syscalls and copy loops aren't really the bottleneck, it's still the I/O (try replacing /dev/null with an on-disk file and the gain plummets: it might be different if the source and target files are on different disks, though). Zero-copy really shines when writing data to a socket: a more interesting usage would be in ftplib & Co. cf

Am 03.03.2013 18:02, schrieb Charles-François Natali:
sendfile() is a Linux-only syscall. It's also limited to certain kinds of file descriptors. The limitations have been lifted in recent kernel versions. http://linux.die.net/man/2/sendfile TL;DR the input fd must support mmap. The output fd used to be socket fd only, since 2.6.33 sendfile() supports any fd as output fd.

Or we could just use: if hasattr(fileobj, 'readinto') hoping that readinto() is really a readinto() implementation and not an unrelated method :-)
No, it's not Linux-only, many BSD also have it, although all don't support an arbitrary output file descriptor (Solaris does allow regular files too). It would be possible to catch EINVAL/EBADF, and fall back to a regular copy loop. Note that the above benchmark is really biased by writing the data to /dev/null: with a real target file, the zero-copy wouldn't bring such a large gain, because the bottleneck will really be the I/O devices (also a read()/write() loop is more expensive in Python than in C). But I see at least two cases where it could be interesting: when reading/writing from/to a tmpfs partition, or when the source and target files are on different disks. I'm not sure it's worth it though, that's why I'm asking here :-) (but I do think readinto() is interesting).

On Sun, 3 Mar 2013 20:55:15 +0100 Charles-François Natali <cf.natali@gmail.com> wrote:
Can you post your benchmark's code? I could time it on a SSD.
Attached (for readinto() and sendfile()).
Ok, the readinto() version doesn't seem to make a difference here, only the sendfile() version is beneficial (and the benefits are mostly noticeable from tmpfs to /dev/null, as you point out :-)). Regards Antoine.

IMNSHO the *time* is less relevant than the fact that it uses less memory by not repeatedly making copies. In general we should use the more recent non-copying APIs when possible within the standard library but most of that code is pretty old and has not been look at for conversion. Any such changes are welcome in 3.4+. On Sun, Mar 3, 2013 at 11:00 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:

On Sun, 3 Mar 2013 13:38:05 -0800 "Gregory P. Smith" <greg@krypto.org> wrote:
IMNSHO the *time* is less relevant than the fact that it uses less memory by not repeatedly making copies.
Well, it doesn't repeatedly make copies, it just allocates a new buffer every loop. At best, it will consume 16 KB instead of 32 KB. Regards Antoine.

On 03.03.13 19:02, Charles-François Natali wrote:
8%. Note that in real cases the difference will be significant less. First, output to real file requires more time than output to /dev/null. Second, you unlikely copy the same input file 30 times in a row. Only first time in the test you read from disk, and 29 times you read from cache. Third, such sources as tarfile has several level between user code and disk file. BufferedIO, GzipFile, internal tarfile wrapper. Every level adds some overhead and in sum this will be many times larger then creating of one bytes object.
This looks more interesting. There are other idea to speedup tarfile extracting. Use dir_fd parameter (if it is available) for opening of target files. It can speedup extracting of a large count of small and deeply nested files. sendfile() should speedup extracting only large files.

I know, I said it was really biased :-) The proper way to perform a cold cache benchmark would be "echo 3 > /proc/sys/vm/drop_caches" before reading the file. The goal was to highlight the reallocation cost (whose cost can vary depending on the implementation).
Not really, because like above, the extra syscalls and copy loops aren't really the bottleneck, it's still the I/O (try replacing /dev/null with an on-disk file and the gain plummets: it might be different if the source and target files are on different disks, though). Zero-copy really shines when writing data to a socket: a more interesting usage would be in ftplib & Co. cf
participants (6)
-
Antoine Pitrou
-
Charles-François Natali
-
Christian Heimes
-
Daniel Holth
-
Gregory P. Smith
-
Serhiy Storchaka