[Twisted-Python] Recursive filescanning in Twisted??
Hi, I want to run a web-server which displays information about files on my filesystems. The info will be stored in either a mysql or sqlite database. The number of files to be scanned can be many and huge in size. I want to click on a link which goes to a specified page/resource on the server which starts the filescanning, generates checksums for the files processed ( sha/md5 ) and stores this info in the database. The process cannot block the server, at least not the filescanning itself. The database interactions is not that critical. It would be nice to have some kind of indication available for users to see that a filescan was in progress, perhaps even some kind of progress indication. Can anybody show me how to do a non-blocking filescan like I've described in a Twisted-based web-application which generates sha/md5-checkums in the process? -- Mvh, Thomas Weholt http://www.weholt.org - thomas krøll weholt dått org ----------------------------------------------------------------------------------------------------------------- Alltid morsomt å høre røykere snakke om inneklima og luftkvalitet.
Thomas Weholt wrote:
Hi,
I want to run a web-server which displays information about files on my filesystems. The info will be stored in either a mysql or sqlite database. The number of files to be scanned can be many and huge in size. I want to click on a link which goes to a specified page/resource on the server which starts the filescanning, generates checksums for the files processed ( sha/md5 ) and stores this info in the database. The process cannot block the server, at least not the filescanning itself. The database interactions is not that critical. It would be nice to have some kind of indication available for users to see that a filescan was in progress, perhaps even some kind of progress indication.
Can anybody show me how to do a non-blocking filescan like I've described in a Twisted-based web-application which generates sha/md5-checkums in the process?
Something like this, perhaps? (defgen.py from radix's Twisted sandbox) Jp import os, sha import defgen from twisted.internet import defer from twisted.python import log, util def sleep(n): from twisted.internet import reactor d = defer.Deferred() reactor.callLater(n, d.callback, None) return d def recursivelyIterate(iterator, rate=0.01): stack = [iterator] while stack: g = stack.pop() try: v = g.next() except StopIteration: pass else: stack.append(g) if isinstance(v, defer.Deferred): yield defgen.waitForDeferred(v) else: try: i = iter(v) except TypeError: pass else: stack.append(i) yield defgen.waitForDeferred(sleep(rate)) recursivelyIterate = defgen.deferredGenerator(recursivelyIterate) def walk(root, visitor, *a, **kw): for f in os.listdir(root): f = os.path.join(root, f) if os.path.isdir(f): yield defgen.waitForDeferred(walk2(f, visitor, *a, **kw)) else: yield defgen.waitForDeferred(visitor(f, *a, **kw)) walk2 = defgen.deferredGenerator(walk) def hash(filename, result): if os.path.isfile(filename): hashObj = sha.sha() fObj = file(filename) for bytes in iter(lambda: fObj.read(8192), ''): hashObj.update(bytes) yield None fObj.close() result(filename, hashObj.digest()) hash = defgen.deferredGenerator(hash) def main(path='.'): from twisted.internet import reactor w = walk(path, hash, lambda f, h: util.println("%s: %s" % (f, h.encode('hex')))) recursivelyIterate(w ).addErrback(log.err ).addBoth(lambda r: reactor.stop() ) reactor.run() if __name__ == '__main__': main()
Ok, thanks!!! Got to get defgen from Subversion, right? And if I was to put the file-info into a database should I do this as another callback to .... eh ... what? Thomas On Thu, 9 Sep 2004 00:17:51 -0400, Matt Feifarek <matt.feifarek@gmail.com> wrote:
Also, in python 2.3, os.walk() is a generator by default. Works great.
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
-- Mvh, Thomas Weholt http://www.weholt.org - thomas krøll weholt dått org ----------------------------------------------------------------------------------------------------------------- Alltid morsomt å høre røykere snakke om inneklima og luftkvalitet.
On Linux, file descriptors for disk files are always "ready". Since twisted is based on select() or poll() I'm guessing you will end up actually blocking on file reads. I think you can get non-blocking file reads using the new AIO stuff but it's not really a good match for a single process server like twisted. I don't know about BSD/OSX or Windows. I would love to be wrong about this, BTW, so if anyone has better info I'd like to here it. Thomas Weholt wrote:
Hi,
I want to run a web-server which displays information about files on my filesystems. The info will be stored in either a mysql or sqlite database. The number of files to be scanned can be many and huge in size. I want to click on a link which goes to a specified page/resource on the server which starts the filescanning, generates checksums for the files processed ( sha/md5 ) and stores this info in the database. The process cannot block the server, at least not the filescanning itself. The database interactions is not that critical. It would be nice to have some kind of indication available for users to see that a filescan was in progress, perhaps even some kind of progress indication.
Can anybody show me how to do a non-blocking filescan like I've described in a Twisted-based web-application which generates sha/md5-checkums in the process?
Jeff Bowden wrote:
On Linux, file descriptors for disk files are always "ready".
Right.
Since twisted is based on select() or poll() I'm guessing you will end up actually blocking on file reads.
Right.
I think you can get non-blocking file reads using the new AIO stuff
On some systems, right.
but it's not really a good match for a single process server like twisted.
Hmm. Why do you say this? It's true that AIO will require reactor support (IOCP is essentially such a reactor, except for Windows). I don't see a problem with using AIO in a single process, though, and apparently neither did Pavel when he wrote IOCP.
I don't know about BSD/OSX or Windows.
KQueue supports on-disk files, I suspect. There is no working KQueue reactor, though, so that takes care of BSD/OSX. Windows has IO Completion Ports, which IOCP uses (hence the name ;). Jp
Ok, all the stuff about AIO ( I guess it's async. IO ) and IOCP doesn't mean much to me right now, but is it possible to just throw the heavy, blocking code into a seperate thread and synchronize writing data to the database, which is the only thing the seperate thread and the main thread Twisted is running in is sharing? Or I'm I over-simplifying things? Any such example if doable would be great. Thanks for your input so far. Thomas On Thu, 09 Sep 2004 14:03:17 -0400, Jp Calderone <exarkun@divmod.com> wrote:
Jeff Bowden wrote:
On Linux, file descriptors for disk files are always "ready".
Right.
Since twisted is based on select() or poll() I'm guessing you will end up actually blocking on file reads.
Right.
I think you can get non-blocking file reads using the new AIO stuff
On some systems, right.
but it's not really a good match for a single process server like twisted.
Hmm. Why do you say this? It's true that AIO will require reactor support (IOCP is essentially such a reactor, except for Windows). I don't see a problem with using AIO in a single process, though, and apparently neither did Pavel when he wrote IOCP.
I don't know about BSD/OSX or Windows.
KQueue supports on-disk files, I suspect. There is no working KQueue reactor, though, so that takes care of BSD/OSX. Windows has IO Completion Ports, which IOCP uses (hence the name ;).
Jp
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
-- Mvh, Thomas Weholt http://www.weholt.org - thomas krøll weholt dått org ----------------------------------------------------------------------------------------------------------------- Alltid morsomt å høre røykere snakke om inneklima og luftkvalitet.
On Thu, 9 Sep 2004 23:50:55 +0200, Thomas Weholt <thomas.weholt@gmail.com> wrote:
Ok, all the stuff about AIO ( I guess it's async. IO ) and IOCP doesn't mean much to me right now, but is it possible to just throw the heavy, blocking code into a seperate thread and synchronize writing data to the database, which is the only thing the seperate thread and the main thread Twisted is running in is sharing? Or I'm I over-simplifying things?
Any such example if doable would be great. Thanks for your input so far.
Unless you're reading off of a network filesystem or similar slow FS layer, you don't need to worry about the filesystem blocking. If the files are big, it should be fine to just read only N kilobytes per reactor iteration, or whatever. That's how twisted.web's static file serving code works, and it works fine. -- Twisted | Christopher Armstrong: International Man of Twistery Radix | Release Manager, Twisted Project ---------+ http://radix.twistedmatrix.com
Jp Calderone wrote:
Jeff Bowden wrote:
On Linux, file descriptors for disk files are always "ready".
Right.
Since twisted is based on select() or poll() I'm guessing you will end up actually blocking on file reads.
Right.
I think you can get non-blocking file reads using the new AIO stuff
On some systems, right.
but it's not really a good match for a single process server like twisted.
Hmm. Why do you say this? It's true that AIO will require reactor support (IOCP is essentially such a reactor, except for Windows). I don't see a problem with using AIO in a single process, though, and apparently neither did Pavel when he wrote IOCP.
I don't know about BSD/OSX or Windows.
KQueue supports on-disk files, I suspect. There is no working KQueue reactor, though, so that takes care of BSD/OSX. Windows has IO Completion Ports, which IOCP uses (hence the name ;).
I'm confused. It sounds like you're saying that there is a reactor called IOCP that runs on Linux and uses AIO. But it also sounds like it's Windows only. Which? Also, it occurred to me since I wrote my original response that I remember a while ago seeing support for async file i/o as a bullet item on the epoll TODO list. I can't find any reference to it at the moment though.
participants (5)
-
Christopher Armstrong
-
Jeff Bowden
-
Jp Calderone
-
Matt Feifarek
-
Thomas Weholt