[Twisted-Python] Uploading multiple files using ftpclient in Twisted
Dear twisted-python expert, I am build engineer and I need to upload binary to ftp server. Binary files are about 30 and each file size is about 30M. The current script is uploading using ftplib and it takes time about 1 hour. I want to change this script to use twisted asynchronous function. I thought if I use asynchronous function in twisted like following, then file uploading will be executed in parallel. But this was executed sequentially. Uploading second file starts afer completing first file upload. Could you check what was wrong in my source code? Or Am I wrong in understanding asynchronous function? for file in fileList: ftpClient.cwd("/tmp").addCallbacks(uploadFiles, fail, callbackArgs=(file,ftpClient)) Thanks, Jaepyoung Full source code: def filesucess(ab): print ab def uploadFiles(result, file,ftpClient): d1,d2=ftpClient.storeFile('/tmp/'+file) d1.addCallback(cbStore, file).addErrback(fileTransferFail) d1.addCallback(filesucess) return d2 def fileTransferFail(failure): failure.printTraceback() reactor.stop() def showBuffer(result, bufferProtocol): print 'Got data:' print bufferProtocol.buffer.getvalue() class Options(usage.Options): optParameters = [['host', 'h', 'localhost'], ['port', 'p', 21], ['username', 'u', 'user'], ['password', None, 'password'], ['passive', None, 0], ['debug', 'd', 1], ] def cbStore(consumer, filename): fs = FileSender() print filename+" cbstror" d = fs.beginFileTransfer(open(filename, 'r'), consumer) d.addCallback(lambda _: consumer.finish()).addErrback(fileTransferFail) return d def run(): # Get config config = Options() config.parseOptions() config.opts['port'] = int(config.opts['port']) config.opts['passive'] = int(config.opts['passive']) config.opts['debug'] = int(config.opts['debug']) # Create the client FTPClient.debug = config.opts['debug'] creator = ClientCreator(reactor, FTPClient, config.opts['username'], config.opts['password'], passive=config.opts['passive']) print config.opts['password'] creator.connectTCP(config.opts['host'], config.opts['port'],timeout=10).addCallback(connectionMade).addErrback(connectionFailed) reactor.run() def connectionFailed(f): print "Connection Failed:", f reactor.stop() def connectionMade(ftpClient): # Get the current working directory ftpClient.pwd().addCallbacks(success, fail) fileList = os.listdir('./temp') for file in fileList: ftpClient.cwd("/tmp").addCallbacks(uploadFiles, fail, callbackArgs=(file,ftpClient)) print "connectionmade"
Jaepyoung Kim <jaepyoung.kim@gmail.com> writes:
The current script is uploading using ftplib and it takes time about 1 hour. I want to change this script to use twisted asynchronous function. I thought if I use asynchronous function in twisted like following, then file uploading will be executed in parallel. But this was executed sequentially. Uploading second file starts afer completing first file upload. Could you check what was wrong in my source code? Or Am I wrong in understanding asynchronous function?
I'm pretty sure you'll need separate connections to an FTP server to achieve parallel transfers, regardless of how you write the client. At least as long as you stick with regular get/put commands. So while using a twisted approach can enable you to manage those parallel streams pretty easily, you'll need distinct connections for each transfer and manage which file transfer is using which connection in your code. Essentially a store or fetch FTP operation initiates a transfer over the dedicated data channel, so that channel is in use until the transfer completes or is aborted. The data on the data channel is not encapsulated nor multiplexed in any way so you can only have a single transfer using the data channel at once. Passive transfers do create new data channels, but the FTP protocol specifically says a server needs to stop accepting connections and shut down any open connections on old passive ports once a new passive request is received, so you're still limited to one at a time. Thus, your callbacks for each store operation, will only file when the store has completed, and you'll only be able to initiate the next store request at that point since its only then that the channel to the server is free to transfer another file. I believe some servers have implemented custom extensions to implement parallel operations at a finer grained level than a file, but I don't think they're commonly implemented in ftp libraries (nor in servers commonly in use). What I'd suggest, in terms of your code, is to instantiate a pool of FTPClients to the same server, initiate transfers on them in parallel and then as one completes, use it to pick up the next file. You'll need to handle the distribution of files amongst the pool of clients yourself. Is there any particular reason you expect this to yield an improvement in overall time? Unless you're transferring very large numbers of files that are very small compared to the bandwidth*latency of your network connection to the server (which doesn't sound like the case here), the overhead of the protocol itself will be quite small, and your bottleneck is either going to be the network throughput, or the slower of the disk I/O on either end. Neither of those bottlenecks will likely be improved by doing multiple transfers in parallel, and in fact your total time can worsen if the prior bottleneck was the disk I/O since you'll have competing operations for the disks as opposed to simple sequential access. Or you may find that you get very marginal benefit with the expense of much more complicated to maintain code. You might grab an existing ftp client that supports parallel transfers and use it to run some tests before trying to re-implement things yourself. There should be several options, but for example, I believe FileZilla supports it under Windows, or lftp under Linux. -- David
David, Great thanks for your perfect answer. I think Disk IO will not be a bottleneck. There are four servers which share disks. I executed the scrpit in seperate servers and this reduced the upload speed a lot. After I saw this performance improvement, I started changing script. I will try your suggestion.. Thanks, Jaepyoung On Sat, Jul 10, 2010 at 3:56 PM, David Bolen <db3l.net@gmail.com> wrote:
Jaepyoung Kim <jaepyoung.kim@gmail.com> writes:
The current script is uploading using ftplib and it takes time about 1 hour. I want to change this script to use twisted asynchronous function. I thought if I use asynchronous function in twisted like following, then file uploading will be executed in parallel. But this was executed sequentially. Uploading second file starts afer completing first file upload. Could you check what was wrong in my source code? Or Am I wrong in understanding asynchronous function?
I'm pretty sure you'll need separate connections to an FTP server to achieve parallel transfers, regardless of how you write the client. At least as long as you stick with regular get/put commands. So while using a twisted approach can enable you to manage those parallel streams pretty easily, you'll need distinct connections for each transfer and manage which file transfer is using which connection in your code.
Essentially a store or fetch FTP operation initiates a transfer over the dedicated data channel, so that channel is in use until the transfer completes or is aborted. The data on the data channel is not encapsulated nor multiplexed in any way so you can only have a single transfer using the data channel at once. Passive transfers do create new data channels, but the FTP protocol specifically says a server needs to stop accepting connections and shut down any open connections on old passive ports once a new passive request is received, so you're still limited to one at a time.
Thus, your callbacks for each store operation, will only file when the store has completed, and you'll only be able to initiate the next store request at that point since its only then that the channel to the server is free to transfer another file.
I believe some servers have implemented custom extensions to implement parallel operations at a finer grained level than a file, but I don't think they're commonly implemented in ftp libraries (nor in servers commonly in use).
What I'd suggest, in terms of your code, is to instantiate a pool of FTPClients to the same server, initiate transfers on them in parallel and then as one completes, use it to pick up the next file. You'll need to handle the distribution of files amongst the pool of clients yourself.
Is there any particular reason you expect this to yield an improvement in overall time? Unless you're transferring very large numbers of files that are very small compared to the bandwidth*latency of your network connection to the server (which doesn't sound like the case here), the overhead of the protocol itself will be quite small, and your bottleneck is either going to be the network throughput, or the slower of the disk I/O on either end.
Neither of those bottlenecks will likely be improved by doing multiple transfers in parallel, and in fact your total time can worsen if the prior bottleneck was the disk I/O since you'll have competing operations for the disks as opposed to simple sequential access. Or you may find that you get very marginal benefit with the expense of much more complicated to maintain code.
You might grab an existing ftp client that supports parallel transfers and use it to run some tests before trying to re-implement things yourself. There should be several options, but for example, I believe FileZilla supports it under Windows, or lftp under Linux.
-- David
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
-- Jaepyoung Kim (Cellular phone) 1-310-848-7774
participants (2)
-
David Bolen
-
Jaepyoung Kim