[Twisted-Python] Send many large files with PB

Hi, I'm using PB in a distributed application that has suddenly grown the requirement to copy directories of files between the servers. probably good to go on how to transfer large amounts of data. However, I wanted to step back and ask what's the best way to actually package the files that I'm going to send. To make things concrete, suppose I need to send data from SRC to DEST and that SRC has a PB RemoteReference to DEST. Also, most files will be huge (gigabytes) and nested in directories. - Should I send the files from SRC to DEST one-by-one? That is, make a new PB request for a new Pager reference for each file, stream the file using a twisted.spread.util.FilePager instance, then repeat with the next file, and so on. This has the advantage that I think I can do it fairly easily, but has the disadvantage of requiring many PB calls (with the associated bookkeeping in my application). - Or, is it better to use something like tarfile module to create a stream of bytes that I stream to the other side and decode? There's something appealing to using tarfile--it's like the oft-seen "tar -xf - | ssh user@host 'tar -cf -'" way of transferring files. Plus, the tarfile module takes care of making directories and the like for me. This method has the advantage of a single PB call, but the disadvantage that I can't quite figure out how to use tarfile with Twisted. The tarfile module requires an file-like object to stream to or stream from. I don't think the naive approach of just adding __write__ method to a Pager or __read__ method to a CallbackPageCollector will work without taking up all of the memory in my system or blocking in some way. - Finally, should I be doing something completely different? Normally, outside of my application, I'd just use rsync, scp, or some such. However, the users of this application don't know how to use these tools. I can't spawn these programs without getting into authentication issues between the machines. Doing this within Twisted seems like a good idea because the machines are already authenticated to each other through PB, but I could be wrong. I apologize if this is rambling. I've been thinking about this for a while and am now a bit bleary-eyed. --Justin [1] http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/457670

Justin Mazzola Paluska wrote:
In my experience, sending big files over PB takes way too much time. This is due to the serialization-deserialization process involved. Paging avoids blocking, which is good, but it still takes much more than sending the files as-is. At the very least, optimize serialization by enable cBanana by uncommenting the lines 311-318 in twisted.spread.banana.py . Why are they commented? http://twistedmatrix.com/pipermail/twisted-python/2004-December/009158.html
You could send the files over an HTTP connection, avoiding the serialization overhead. Setting up HTTP clients and servers is very easy in Twisted, as you surely know.
[1] http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/457670
-- Nicola Larosa - http://www.tekNico.net/ If you know you can love work, you're in the home stretch, and if you know what work you love, you're practically there. -- Paul Graham, January 2006

On Wed, May 17, 2006 at 08:41:21AM +0200, Nicola Larosa wrote:
How bad is the slow down? Or, to ask the question another way, how much CPU will the process actually take? I ask because the machines in question are being used as file servers for streaming applications like video and whatnot, so having Twisted suck up all of the CPU may disrupt the other streams. (We're not using Twisted to actually serve the files, but rather as the application framework for various web- and GUI- based management clients. The copy data feature would be the first time we're moving lots of data with Twisted.)
Indeed! Using HTTP is appealing because it is closer to just stuffing bits in a socket from an efficiency standpoint. Is there a way of passing a RemoteReference to an HTTP server? Or is the best thing to do just use the PB to send a URL to the DEST server? --Justin

Nicola Larosa:
Justin Mazzola Paluska:
How bad is the slow down? Or, to ask the question another way, how much CPU will the process actually take?
100% CPU for all the time it takes. Serialization is CPU- and memory-intensive.
I don't think so. There's no overlapping that I know of, between PB and HTTP.
Or is the best thing to do just use the PB to send a URL to the DEST server?
That's what I was hinting at, yes. Of course you should separately take care of any required authentication, authorization and encryption on the HTTP connection. -- Nicola Larosa - http://www.tekNico.net/ If you know you can love work, you're in the home stretch, and if you know what work you love, you're practically there. -- Paul Graham, January 2006

Justin Mazzola Paluska wrote:
In my experience, sending big files over PB takes way too much time. This is due to the serialization-deserialization process involved. Paging avoids blocking, which is good, but it still takes much more than sending the files as-is. At the very least, optimize serialization by enable cBanana by uncommenting the lines 311-318 in twisted.spread.banana.py . Why are they commented? http://twistedmatrix.com/pipermail/twisted-python/2004-December/009158.html
You could send the files over an HTTP connection, avoiding the serialization overhead. Setting up HTTP clients and servers is very easy in Twisted, as you surely know.
[1] http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/457670
-- Nicola Larosa - http://www.tekNico.net/ If you know you can love work, you're in the home stretch, and if you know what work you love, you're practically there. -- Paul Graham, January 2006

On Wed, May 17, 2006 at 08:41:21AM +0200, Nicola Larosa wrote:
How bad is the slow down? Or, to ask the question another way, how much CPU will the process actually take? I ask because the machines in question are being used as file servers for streaming applications like video and whatnot, so having Twisted suck up all of the CPU may disrupt the other streams. (We're not using Twisted to actually serve the files, but rather as the application framework for various web- and GUI- based management clients. The copy data feature would be the first time we're moving lots of data with Twisted.)
Indeed! Using HTTP is appealing because it is closer to just stuffing bits in a socket from an efficiency standpoint. Is there a way of passing a RemoteReference to an HTTP server? Or is the best thing to do just use the PB to send a URL to the DEST server? --Justin

Nicola Larosa:
Justin Mazzola Paluska:
How bad is the slow down? Or, to ask the question another way, how much CPU will the process actually take?
100% CPU for all the time it takes. Serialization is CPU- and memory-intensive.
I don't think so. There's no overlapping that I know of, between PB and HTTP.
Or is the best thing to do just use the PB to send a URL to the DEST server?
That's what I was hinting at, yes. Of course you should separately take care of any required authentication, authorization and encryption on the HTTP connection. -- Nicola Larosa - http://www.tekNico.net/ If you know you can love work, you're in the home stretch, and if you know what work you love, you're practically there. -- Paul Graham, January 2006
participants (2)
-
Justin Mazzola Paluska
-
Nicola Larosa