[Web-SIG] Client-side support: what are we aiming for?
John J Lee
jjl at pobox.com
Thu Oct 23 20:46:46 EDT 2003
On Thu, 23 Oct 2003, Bill Janssen wrote:
> amk writes:
> > What's the scope of improving client-side HTTP support?
> > I suggest aiming for something you could write a web browser or web scraper
> > on top of. That means storing and returning cookies from the server, writing
> > them to a file, and a page cache that handles HTTP's cache expiration rules.
> > HTML formatting is out of scope, but a specialized parser for extracting a
> > list of form elements or for picking apart a table might not be.
I've been working on that kind of stuff.
I certainly think automatic cookie handling would be appropriate for the
std lib. I've written code to do that (based on a port from libwww-perl,
but substantially changed since then), which is already integrated into
urllib2 (albeit ATM including a lot of junk for backwards-compatibility
and some cut-n-pasting necessary because it's not (yet) actually part of
the Python standard library). The only problem is that it's rather large.
I claim this is (mostly) not my fault ;-) because the cookie standards are
a royal mess. For a number of reasons, it will be significantly smaller
in the form I hope will get into the Python standard lib., but it'll still
be bigish. Still, you *could* quite easily write a much less anal
implementation that worked most of the time. One risk of that is that
you'd have to put up with a constant stream of bugs from people finding
that website x breaks your simple impelemntation. At least, Ronald
Tschalar (author of one of two Java libraries both named HTTPClient) tells
me that was his experience. The fundamental problem is that the cookie
'standard' is really just Mozilla and MSIE's behaviour. For a brief
summary of the sad tale, see:
OTOH, my code goes to some effort to enforce as many restrictions as
possible to prevent cookies getting set and returned when they shouldn't.
That could be cut without losing functionality (but obviously, losing
security, for those who care about that). That seems pointless to me now
that code is pretty stable, though.
One thing about my implementation that might seem like it should be cut
out is RFC 2965 support. It seems fairly safe to say that RFC 2965 is all
but officially dead as an internet standard (and the same goes for RFC
2109, though I'm told a few servers implement it in some form -- *clients*
have taken bits and pieces from the standard, but very few of those could
be called RFC 2109 implementations: I regard those bits of the RFC 2109
standard as simply parts of the current state of the de-facto Netscape
protocol). The one guy who was driving forward errata for RFC 2965 on the
http-state mailing list seems to have succumbed to cookie-fatigue. I
guess it's still useful on intranets. Half of the reason it's still in my
code is simply that the Netscape cookie protocol is a messy de facto
standard, and it seems far easier and more secure to specify it by the
ways it differs from the RFC standard than to have it stand on its own
feet. It also allows you to easily tighten up the Netscape rules if you
feel like it (assuming that doesn't break the particular site you're
using). The remaining 25% of the reason it's there is that I don't have
the heart to rip it out ;-)
So, that's my pitch for justifying the inclusion of ClientCookie (in a
somewhat reduced form) in the standard library. Jeremy Hylton seemed to
like the idea of having it in the std lib, but I don't know if he looked
at the code :-)
A related issue is urllib2's 'handler' system, which I've discovered isn't
quite flexible enough to implement a number useful features (including
automatic cookie handling). I think it's possible to fix this without
breaking anybody's code. Full details here:
Jeremy said a few months back that he'd look at it, but I've heard nothing
from him since...
As for forms, originally I thought the forms code I wrote (ClientForm --
again, based on a port of Gisle Aas' libwww-perl, and again quite
substantially changed since then) might be nice in the std lib, but I
changed my mind a long while ago for a number of reasons. But if anybody
wants to talk about HTML form parsers, of course, feel free to start a
thread. Same goes for HTML table parsing -- I'm not convinced the
standard library is the place for this.
I certainly think a function for doing file uploads would be great,
though. Steve Purcell has some code for that in his old webunit module
(there seems to be a new Python module called webunit here
http://mechanicalcat.net/tech/webunit but the code download link is
broken), and so do I in ClientForm. My code depends on a modified version
of MimeWriter. I think it would be nice to fix MimeWriter so it could do
this job. I think that's possible without breaking old code, though I
know almost nothing about MIME.
> My original idea was to look at something like cURL
> (http://curl.haxx.se/), and make sure anything you could do with that
> tool, you could do with Python. Might be a bit ambitious; here's the
> lead paragraph from the cURL web page:
> Curl is a command line tool for transferring files with URL syntax,
> supporting FTP, FTPS, HTTP, HTTPS, GOPHER, TELNET, DICT, FILE and
> LDAP. Curl supports HTTPS certificates, HTTP POST, HTTP PUT, FTP
> uploading, kerberos, HTTP form based upload, proxies, cookies,
> user+password authentication, file transfer resume, http proxy
> tunneling and a busload of other useful tricks.
I don't think it's a good idea to start on some new grand library,
certainly not in the std lib. Gradual evolution seems more appropriate.
Most of the stuff you list is either already there, or would fit it quite
neatly into the current framework without any major upheavals.
> Then there are issues about handling the Web-centric formats you get
> back. There's no CSS parser, for instance. It's hard to understand a
> modern Web page without one.
What uses do you have in mind for that?
Whaaat?? You want a JS interpreter included with the Python distribution?
You're kidding, right? :-)
More information about the Web-SIG