Re: [Twisted-Python] Send many large files with PB

Justin Mazzola Paluska <jmp@MIT.EDU> writes:
- Should I send the files from SRC to DEST one-by-one?
That's how I would do it. If you're talking about gigabyte-sized files, the protocol overhead will be pretty minimal compared to the data being transferred. You've got a couple of objects to keep track of for each file being sent, but on the other hand it will be a lot easier to keep track of how much progress you've made (and keep the user informed) that way.
- Or, is it better to use something like tarfile module to create a stream of bytes that I stream to the other side and decode?
I would recommend this approach if you had a bunch of small files. You want to run that 'tar cf - WHAT' child against a ProcessProtocol that reacted to dataReceived(data) by doing a rref.callRemote("moreDataForYou", data). You'd probably want to accumulate data into chunks of maybe 4k or so to increase efficiency. At the far end, your remote_moreDataForYou() call would write that data into the untarring ProcessProtocol. Take a look at doc/core/howto/process.xhtml for details on ProcessProtocols and reactor.spawnProcess.
- Finally, should I be doing something completely different? Normally, outside of my application, I'd just use rsync, scp, or some such.
I'd certainly investigate this method if the most of the files you are sending are already in place on the far end. The bandwidth savings are worth the extra setup hassle. Is there a way to get rsync to speak to stdout/stdin instead of using a TCP socket? If so, you could spawnProcess('rsync') and proxy it to the far end over PB as with 'tar' above. Or, you could have your PB-connection-wielding process listen on a local TCP socket, then tell rsync to talk directly to that port, then do a socket-level proxy over PB to the far system. Also remember that scp (or rsync-over-ssh or tar|ssh, etc) will be doing better authentication than PB, since PB is all in cleartext. Many applications don't require confidentiality, but before you switch from ssh to straight PB you should be aware of what exactly you're giving up. <shameless plug> But, if you use NewPB, you get the strong authentication and confidentiality of ssh with all of the juicy RemoteReference model you've come to know and love from PB, check out NewPB[1] today. </shameless plug>. cheers, -Brian [1]: http://twistedmatrix.com/trac/wiki/NewPB

On Tue, May 16, 2006 at 11:41:12PM -0700, Brian Warner wrote:
Justin Mazzola Paluska <jmp@MIT.EDU> writes:
- Should I send the files from SRC to DEST one-by-one?
That's how I would do it. If you're talking about gigabyte-sized files, the protocol overhead will be pretty minimal compared to the data being transferred. You've got a couple of objects to keep track of for each file being sent, but on the other hand it will be a lot easier to keep track of how much progress you've made (and keep the user informed) that way.
OK. I could also possibly stream multiple files at once with this method, which is an added bonus.
- Finally, should I be doing something completely different? Normally, outside of my application, I'd just use rsync, scp, or some such.
I'd certainly investigate this method if the most of the files you are sending are already in place on the far end. The bandwidth savings are worth the extra setup hassle.
For this particular job, none of the files are initially in place on the remote end, so rsync itself won't be a big win.
Is there a way to get rsync to speak to stdout/stdin instead of using a TCP socket? If so, you could spawnProcess('rsync') and proxy it to the far end over PB as with 'tar' above. Or, you could have your PB-connection-wielding process listen on a local TCP socket, then tell rsync to talk directly to that port, then do a socket-level proxy over PB to the far system.
For future reference, I think there are ways of hacking this (these statements are conjectures, I haven't actually tried them): - on the side pushing data, use --rsh= some script that just takes the output of rsync and pushes it to stdout. - on the side receiving the data, use --server to read from stdin.
Also remember that scp (or rsync-over-ssh or tar|ssh, etc) will be doing better authentication than PB, since PB is all in cleartext. Many applications don't require confidentiality, but before you switch from ssh to straight PB you should be aware of what exactly you're giving up.
Our PB connections go over SSL and we have a custom auth module, so piping everything over PB wouldn't be a big loss.
<shameless plug> But, if you use NewPB, you get the strong authentication and confidentiality of ssh with all of the juicy RemoteReference model you've come to know and love from PB, check out NewPB[1] today. </shameless plug>.
I've been reading about NewPB and it might be exactly what we'll need for the next revision of our application. We're just too close to pushing out this version to switch to a new RPC method for the core of the program. Thanks, --Justin
participants (2)
-
Brian Warner
-
Justin Mazzola Paluska