[XML-SIG] Ideas for web/ package
Fri, 15 Feb 2002 19:37:18 +0100
Andrew Kuchling wrote:
> On Fri, Feb 15, 2002 at 12:31:32PM -0500, Fred L. Drake, Jr. wrote:
> >Perhaps the urlparse module should be re-written in C, though. But
> >not today. I think Skip did part of this some time ago as his urlop
You might also want to take a look at mxURL which is part of
> As part of the RELAX NG stuff, I've discovered that urlparse() is
> really lenient in its parsing. For example, the fragment value is ''
> if no fragment is supplied, so you can't distinguish between
> http://www.amk.ca and http://www.amk.ca# . Unfortunately this can't
> really be fixed without changing the API of urlparse() and breaking
> old code.
Are you sure that the two URLs you gave are different in any
trick, but it is not clear to me why "index.html#" would mean
anything different from "index.html".
mxURL returns '' in both cases, since there is no fragment
definition there to be found.
> So I had the idea of creating a new 'web.*' package containing updated
> tools for Web-related tasks, so we can make a clean break with the old
> APIs. The two things for the web/ package that I can think of are 1)
> a stricter URL parser, and 2) the skeleton of a Web client that
> handles cookies and caching sensibly (so you could write
> screen-scraping applications on top of it).
> Can anyone think of other things that could be part of this package?
The usual bunch of HTML tweaking functions, e.g. fast escaping,
unescaping, finding certain parts within the page (in a non-parsing
way, since this often breaks with todays HTML hackery),
link checker, link finder, etc.
Note that mxTextTools has a HTML scanner which can be very
helpful with this (and it's also very fast at what it's
CEO eGenix.com Software GmbH
Company & Consulting: http://www.egenix.com/
Python Software: http://www.egenix.com/files/python/