[Twisted-Python] Send many large files with PB
From lurking on the mailing list archives, it seems that the best way to move large amounts of data between Twisted servers is to use a twisted.spread.util.Pager sub-class to pipe the data. Between that information and the "How to use twisted pb pager" [1] document, I'm
Hi, I'm using PB in a distributed application that has suddenly grown the requirement to copy directories of files between the servers. probably good to go on how to transfer large amounts of data. However, I wanted to step back and ask what's the best way to actually package the files that I'm going to send. To make things concrete, suppose I need to send data from SRC to DEST and that SRC has a PB RemoteReference to DEST. Also, most files will be huge (gigabytes) and nested in directories. - Should I send the files from SRC to DEST one-by-one? That is, make a new PB request for a new Pager reference for each file, stream the file using a twisted.spread.util.FilePager instance, then repeat with the next file, and so on. This has the advantage that I think I can do it fairly easily, but has the disadvantage of requiring many PB calls (with the associated bookkeeping in my application). - Or, is it better to use something like tarfile module to create a stream of bytes that I stream to the other side and decode? There's something appealing to using tarfile--it's like the oft-seen "tar -xf - | ssh user@host 'tar -cf -'" way of transferring files. Plus, the tarfile module takes care of making directories and the like for me. This method has the advantage of a single PB call, but the disadvantage that I can't quite figure out how to use tarfile with Twisted. The tarfile module requires an file-like object to stream to or stream from. I don't think the naive approach of just adding __write__ method to a Pager or __read__ method to a CallbackPageCollector will work without taking up all of the memory in my system or blocking in some way. - Finally, should I be doing something completely different? Normally, outside of my application, I'd just use rsync, scp, or some such. However, the users of this application don't know how to use these tools. I can't spawn these programs without getting into authentication issues between the machines. Doing this within Twisted seems like a good idea because the machines are already authenticated to each other through PB, but I could be wrong. I apologize if this is rambling. I've been thinking about this for a while and am now a bit bleary-eyed. --Justin [1] http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/457670
participants (2)
-
Justin Mazzola Paluska
-
Nicola Larosa