[Twisted-Python] Way to fix memory leaks of external c module
![](https://secure.gravatar.com/avatar/cf960565667928aeb0a68368db7586f3.jpg?s=120&d=mm&r=g)
Hello! Currently I'm trying to write small xmlrpc server for html data processing. Processing is done by html tidy lib, but the problem is that it has massive memory leak. As processing is blocking operation I'm running it in thread, but after some time and huge html document processing daemon eats all memory. I wondering if its possible to load utidylib in thread, do processing and after this kill thread and release memory? Or maybe something like deferToProcess? Thanks in advance! #!/usr/bin/env python # -*- coding: utf-8 -*- import utidylib from twisted.internet import epollreactor epollreactor.install() from twisted.internet import protocol, defer, threads, reactor from twisted.web import xmlrpc, server from twisted.python import log, threadpool import sys reload(sys) sys.setdefaultencoding('utf-8') log.startLogging(sys.stdout) import codecs import gc gc.enable() gc.set_debug(gc.DEBUG_LEAK) gc.set_threshold(1) class TidyProtocol(xmlrpc.XMLRPC): def xmlrpc_tidify(self, data): defered = threads.deferToThread(self.tidyParse, data) defered.addCallback(self.returnToClient) return defered def tidyParse(self, data): options = { 'drop-proprietary-attributes': '1', 'output-xhtml': '1', 'wrap': '0', 'bare': '0', 'clean': '1', 'doctype': 'omit', 'show-body-only': '1', 'word-2000': '0', 'escape-cdata': '0', 'hide-comments': '1', 'force-output': '1', 'alt-text': '', 'show-errors': '0', 'show-warnings': '0', 'tidy-mark': '0', 'char-encoding': 'utf8', } if data['html'] == None: return None else: htmldata = data['html'].encode() print "Tidy start" return tidy.parseString(htmldata, **options) def returnToClient(self, data): gc.collect() print "Tidy end, retunring result" return data if __name__ == '__main__': r = TidyProtocol() reactor.listenTCP(1100, server.Site(r)) reactor.run()
![](https://secure.gravatar.com/avatar/d7875f8cfd8ba9262bfff2bf6f6f9b35.jpg?s=120&d=mm&r=g)
On Sat, 2009-11-28 at 15:05 +0200, MārisR wrote:
1. You should report the bug to the utidylib authors, so they can fix it or pass it on to tidy authors. 2. A thread wouldn't help... but a process certainly would. https://launchpad.net/ampoule may be helpful if you don't want to implement your own process wrapper, or you could just run a sub-process that takes input and output filenames on the command-line and pass data around that way.
![](https://secure.gravatar.com/avatar/cf960565667928aeb0a68368db7586f3.jpg?s=120&d=mm&r=g)
Itamar Turner-Trauring (aka Shtull-Trauring) wrote:
Thanks for replay! I tried pytidylib, and have same issue with memory leak, unfortunately I couldn't report bug to, because, tidylib is self patched by one c programmer, but problem, is that he have no more time for it. After day on playing with ampoule, I got it working :) But I run into another problem, AMP value length is limited to 64kb. Seems its a struct pack limitation. Probably I could split message into chunks <64kb each and feed to amp something like: pp.doWork(Tidy, html={'chunk1': data1, 'chunk2': data2...} ) or maybe there is more easy/clean way?
![](https://secure.gravatar.com/avatar/d7875f8cfd8ba9262bfff2bf6f6f9b35.jpg?s=120&d=mm&r=g)
On Sat, 2009-11-28 at 15:05 +0200, MārisR wrote:
1. You should report the bug to the utidylib authors, so they can fix it or pass it on to tidy authors. 2. A thread wouldn't help... but a process certainly would. https://launchpad.net/ampoule may be helpful if you don't want to implement your own process wrapper, or you could just run a sub-process that takes input and output filenames on the command-line and pass data around that way.
![](https://secure.gravatar.com/avatar/cf960565667928aeb0a68368db7586f3.jpg?s=120&d=mm&r=g)
Itamar Turner-Trauring (aka Shtull-Trauring) wrote:
Thanks for replay! I tried pytidylib, and have same issue with memory leak, unfortunately I couldn't report bug to, because, tidylib is self patched by one c programmer, but problem, is that he have no more time for it. After day on playing with ampoule, I got it working :) But I run into another problem, AMP value length is limited to 64kb. Seems its a struct pack limitation. Probably I could split message into chunks <64kb each and feed to amp something like: pp.doWork(Tidy, html={'chunk1': data1, 'chunk2': data2...} ) or maybe there is more easy/clean way?
participants (4)
-
exarkun@twistedmatrix.com
-
Itamar Turner-Trauring (aka Shtull-Trauring)
-
Maris Ruskulis
-
MārisR