Design for HTTP client form login and scraping
I want to make multiple HTTP requests using the same set of cookies. Should I call client.getPage a lot thus creating multiple factories to do this? Should I subclass the HTTPClientFactory and put logic in there? I am kinda lost on how to attack this. I want the program to login to a site and run some scraping on it. Regards, David Bern
On Mon, 27 Jul 2009 01:05:21 -0500, David Bern <odie5533@gmail.com> wrote:
I want to make multiple HTTP requests using the same set of cookies. Should I call client.getPage a lot thus creating multiple factories to do this?
This is probably the right approach for the near term. If you're worried about the overhead of creating a lot of factories, I don't think you should. Creating these objects isn't very expensive (particularly compared to parsing html). Jean-Paul
On Mon, Jul 27, 2009 at 8:01 AM, Jean-Paul Calderone<exarkun@divmod.com> wrote:
On Mon, 27 Jul 2009 01:05:21 -0500, David Bern <odie5533@gmail.com> wrote:
I want to make multiple HTTP requests using the same set of cookies. Should I call client.getPage a lot thus creating multiple factories to do this?
This is probably the right approach for the near term. If you're worried about the overhead of creating a lot of factories, I don't think you should. Creating these objects isn't very expensive (particularly compared to parsing html).
Jean-Paul
_______________________________________________ Twisted-web mailing list Twisted-web@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-web
Thank you for the fast reply. I am more worried about the difficulty of programming it this way, and was wondering if there is a better method. I want a class with functions which would correspond to different forms and pages on the web site. For instance, login(), then post_message(), or some such set of commands based on a configuration file. Would creating a class which calls client.getPage without inheriting anything be the best method to accomplish this? When would inheritance be the right method, only when extending the functionality of HTTP client in Twisted and not to make use of it? My main question here is that of style and ease of programming. I want to get into a good programming habit with Twisted rather than have to redesign and rewrite huge portions later because I put the logic in the wrong place. -- Thanks again, David Bern
On Mon, 27 Jul 2009 09:07:49 -0500, David Bern <odie5533@gmail.com> wrote:
On Mon, Jul 27, 2009 at 8:01 AM, Jean-Paul Calderone<exarkun@divmod.com> wrote:
On Mon, 27 Jul 2009 01:05:21 -0500, David Bern <odie5533@gmail.com> wrote:
I want to make multiple HTTP requests using the same set of cookies. Should I call client.getPage a lot thus creating multiple factories to do this?
This is probably the right approach for the near term. If you're worried about the overhead of creating a lot of factories, I don't think you should. Creating these objects isn't very expensive (particularly compared to parsing html).
Jean-Paul
_______________________________________________ Twisted-web mailing list Twisted-web@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-web
Thank you for the fast reply. I am more worried about the difficulty of programming it this way, and was wondering if there is a better method. I want a class with functions which would correspond to different forms and pages on the web site. For instance, login(), then post_message(), or some such set of commands based on a configuration file. Would creating a class which calls client.getPage without inheriting anything be the best method to accomplish this?
That's probably what I'd do. There's not much to be gained by subclassing anything from twisted.web.client in this case. I would reserve that for cases where I wanted to change the behavior of something at the HTTP level, for example providing different behavior for handling redirects.
When would inheritance be the right method, only when extending the functionality of HTTP client in Twisted and not to make use of it? My main question here is that of style and ease of programming. I want to get into a good programming habit with Twisted rather than have to redesign and rewrite huge portions later because I put the logic in the wrong place.
I'm coming to think that a good rule of thumb is not to subclass things unless you really need to. There are probably some exceptions - for example, I'll probably keep subclassing twisted.internet.protocol.Protocol for a while to come, but my list of such exceptions is pretty short right now. I find subclassing to mainly cause problems - mostly to do with backwards compatibility - and not offer sufficient benefits to outweigh these. Jean-Paul
participants (2)
-
David Bern
-
Jean-Paul Calderone