HTML Content Rewriting

Steve Holden sholden at holdenweb.com
Wed Jun 27 09:39:25 EDT 2001


"Merton Campbell Crockett" <mcc at TO.GD-ES.COM> wrote ...
> Several years ago, I developed a system for a customer that allowed their
> employees and customers to securely access web content on servers inside
> their firewall.  Basically, I used Apache's mod_rewrite module to
implement
> what might be called a "dual reverse proxy".
>
> Unfortunately, times have changed.  Several of the customer's
organizations
> have started playing with various web development tools that create
dynamic
> content.  Several of these embed information from the HTTP requests in the
> documents that are generated.
>
> At a minimum this embedded information results in warnings about protocol
> changes, i.e. hard-coded links that specify an http: method when the
remote
> users are using the https: method.  At worse, there are references to
> internal names and IP addresses that are not accessible from the Internet.
>
> Both PHP and Python seem to provide capabilities that would allow "fix
ups"
> to be applied to the content as it is delivered to the remote user.
Python
> looks like it might have a few more tools for manipulating HTML content.
>
> What I would like to do is dynamically add a BASE tag to the document and
> convert all absolute to relative references if they involve the current
web
> site.  For references to other web servers accessible through this
facility,
> I would like to ensure that the references are in the external form and to
> disable the links to web servers that are not accessible by remote users.
>
> What I would like from this group is some guidance.  Can this be done with
> Python?  Are there existing Python tools that might perform some of the
> functions that I would like performed?  What pitfalls and "gotchas" should
> one be aware?
>
Yes, Python can easily do this. You should look at htmllib (or, more
generally, sgmllib) for parsing the HTML input, and think about httplib (or
the newer httplib2, if you think its complexity might be required) for
actually reading any pages you have to pull down from a server rather than
reading from local files.

To server web pages there are various *HTTPServer modules, choose your
poison.

Hope this helps.

regards
 Steve
--
http://www.holdenweb.com/








More information about the Python-list mailing list