[Twisted-Python] Memory leak//problem in twisted write procedures

I believe I have identified a memory problem in twisted. A quick background.. I've been developing a program that has to use mssql for reasons beyond my control. Due to the chaotic nature of mssql tds implementations, I created a program to pipe queries through to a windows box. In the beginning, I was using asyn* for my communications, however asyn* was very inefficient when writing large amounts of data as after each write the string would be spliced to remove the data written. I then moved to twisted to use it's far more efficient buffer implementation, and the performance problems went away. However, I have since been hit with an apparent memory leak. After reading through the twisted source (specifically the write routines in twisted/internet/abstract.py) I believe I have come across the problem. *** the write implementation continually increases it's write databuffer, only freeing up memory when the data buffer is all sent However, successive write operations append data to the data buffer.. So if a program is able to keep the data buffer from ever completely emptying (which mine is) then the data buffer will forever grow, resulting in a memory problem. I have added some output to print out the length of the data buffer and the offset, and after only 10 minutes of running the numbers are already offset == 76906496 dataBuffer len == 120768384 *** Now, reading through the source to fix this problem, the fastest solution (requiring the least change to the existing code) would be to splice//reduce the size of the dataBuffer after offset exceeds a certain number. However, I'm not too intimate with the source.. I will try that above solution to see if it does any good.. otherwise, i think that this is a fairly serious bug that should be addressed by someone more intimate with the source. Thanks, Joshua Moore-Oliva

On Thu, 16 Sep 2004 20:03:38 -0400, Joshua Moore-Oliva <josh@chatgris.com> wrote:
Whether or not "leak" is the appropriate term for this is debatable. That aside, it is somewhat undesirable, but it is completely avoidable without making any changes to Twisted*. There are two concepts involved, Producers and Consumers. The transport is a Consumer in this case, and whatever protocol you have that is writing to it is the Producer. The Consumer will notify the Producer when it would be a good idea to write more bytes. Since this will only happen when the write buffer is empty, it avoids the problem of an ever growing buffer in the transport. Producers and Consumers aren't _too_ well documented, but the relevant interfaces are quite simple: http://twistedmatrix.com/documents/current/api/twisted.internet.interfaces.I... http://twistedmatrix.com/documents/current/api/twisted.internet.interfaces.I... There are a few examples throughout Twisted itself. One such (FileSender, at the end of bottom of the module): http://cvs.twistedmatrix.com/cvs/trunk/twisted/protocols/basic.py?view=markup&rev=11450&root=Twisted
Now, reading through the source to fix this problem, the fastest solution (requiring the least change to the existing code) would be to splice//reduce the size of the dataBuffer after offset exceeds a certain number.
It is the easiest change to make, but it leads to detrimental string copying behavior. Which this is a minor concern in the case where many small writes are being made while the buffer is large (because huge amounts of copying is already going on), it is a noticable slowdown in the more common case when the buffer is typically empty or almost empty. Jp * A change _should_ be made to Twisted eventually. A good solution would involve a zero-copy buffering system, such as a list. There is an implementation of this, but it involves so many nasty hacks that I don't feel it is worth including. Shortly after 2.0 I plan to find time to clean up many of the low-level TCP implementation details, as they have grown increasingly crufty over the last year.

Whether or not "leak" is the appropriate term for this is debatable.
I agree there, hence my use of Memory leak//problem :)
I do not think my idea was properly communicated.. An interim change I think could work quite well along these lines from the abstract.py file def doWrite(self): ... # If there is nothing left to send, if self.offset == len(self.dataBuffer): self.dataBuffer = "" self.offset = 0 ... elif self.offset > 1000000: #This would be best pre-defined somewhere self.dataBuffer = self.dataBuffer[self.offset:] self.offset=0 return result Note that these changes are just done in 'real-time' while writing my email, no guarantees that the change will even let pythons tartup without syntax errors :) However, notice the check for offset > a particular value. This would prevent performance degredation for the more common case where the buffer is typically empty or almost empty, as the string would only be reduced in the event that too much memory was being wasted (even 10MB would be a good limit)
Jp
* A change _should_ be made to Twisted eventually. A good solution would involve a zero-copy buffering system, such as a list. There is an implementation of this, but it involves so many nasty hacks that I don't feel it is worth including. Shortly after 2.0 I plan to find time to clean up many of the low-level TCP implementation details, as they have grown increasingly crufty over the last year.
I agree with you that a list would be the best solution in the long term, but the above proposed solution would remove the current problem, as well as not affecting the more common small buffer cases. Joshua Moore-Oliva

On Thu, 16 Sep 2004 20:03:38 -0400, Joshua Moore-Oliva <josh@chatgris.com> wrote:
Whether or not "leak" is the appropriate term for this is debatable. That aside, it is somewhat undesirable, but it is completely avoidable without making any changes to Twisted*. There are two concepts involved, Producers and Consumers. The transport is a Consumer in this case, and whatever protocol you have that is writing to it is the Producer. The Consumer will notify the Producer when it would be a good idea to write more bytes. Since this will only happen when the write buffer is empty, it avoids the problem of an ever growing buffer in the transport. Producers and Consumers aren't _too_ well documented, but the relevant interfaces are quite simple: http://twistedmatrix.com/documents/current/api/twisted.internet.interfaces.I... http://twistedmatrix.com/documents/current/api/twisted.internet.interfaces.I... There are a few examples throughout Twisted itself. One such (FileSender, at the end of bottom of the module): http://cvs.twistedmatrix.com/cvs/trunk/twisted/protocols/basic.py?view=markup&rev=11450&root=Twisted
Now, reading through the source to fix this problem, the fastest solution (requiring the least change to the existing code) would be to splice//reduce the size of the dataBuffer after offset exceeds a certain number.
It is the easiest change to make, but it leads to detrimental string copying behavior. Which this is a minor concern in the case where many small writes are being made while the buffer is large (because huge amounts of copying is already going on), it is a noticable slowdown in the more common case when the buffer is typically empty or almost empty. Jp * A change _should_ be made to Twisted eventually. A good solution would involve a zero-copy buffering system, such as a list. There is an implementation of this, but it involves so many nasty hacks that I don't feel it is worth including. Shortly after 2.0 I plan to find time to clean up many of the low-level TCP implementation details, as they have grown increasingly crufty over the last year.

Whether or not "leak" is the appropriate term for this is debatable.
I agree there, hence my use of Memory leak//problem :)
I do not think my idea was properly communicated.. An interim change I think could work quite well along these lines from the abstract.py file def doWrite(self): ... # If there is nothing left to send, if self.offset == len(self.dataBuffer): self.dataBuffer = "" self.offset = 0 ... elif self.offset > 1000000: #This would be best pre-defined somewhere self.dataBuffer = self.dataBuffer[self.offset:] self.offset=0 return result Note that these changes are just done in 'real-time' while writing my email, no guarantees that the change will even let pythons tartup without syntax errors :) However, notice the check for offset > a particular value. This would prevent performance degredation for the more common case where the buffer is typically empty or almost empty, as the string would only be reduced in the event that too much memory was being wasted (even 10MB would be a good limit)
Jp
* A change _should_ be made to Twisted eventually. A good solution would involve a zero-copy buffering system, such as a list. There is an implementation of this, but it involves so many nasty hacks that I don't feel it is worth including. Shortly after 2.0 I plan to find time to clean up many of the low-level TCP implementation details, as they have grown increasingly crufty over the last year.
I agree with you that a list would be the best solution in the long term, but the above proposed solution would remove the current problem, as well as not affecting the more common small buffer cases. Joshua Moore-Oliva
participants (2)
-
exarkun@divmod.com
-
Joshua Moore-Oliva