HTML Content Rewriting

Merton Campbell Crockett mcc at TO.GD-ES.COM
Wed Jun 27 08:54:40 CEST 2001

Several years ago, I developed a system for a customer that allowed their
employees and customers to securely access web content on servers inside
their firewall.  Basically, I used Apache's mod_rewrite module to implement
what might be called a "dual reverse proxy".

Unfortunately, times have changed.  Several of the customer's organizations
have started playing with various web development tools that create dynamic
content.  Several of these embed information from the HTTP requests in the
documents that are generated.

At a minimum this embedded information results in warnings about protocol
changes, i.e. hard-coded links that specify an http: method when the remote
users are using the https: method.  At worse, there are references to
internal names and IP addresses that are not accessible from the Internet.

Both PHP and Python seem to provide capabilities that would allow "fix ups"
to be applied to the content as it is delivered to the remote user.  Python
looks like it might have a few more tools for manipulating HTML content.

What I would like to do is dynamically add a BASE tag to the document and
convert all absolute to relative references if they involve the current web
site.  For references to other web servers accessible through this facility,
I would like to ensure that the references are in the external form and to
disable the links to web servers that are not accessible by remote users.

What I would like from this group is some guidance.  Can this be done with
Python?  Are there existing Python tools that might perform some of the
functions that I would like performed?  What pitfalls and "gotchas" should
one be aware?

Merton Campbell Crockett

More information about the Python-list mailing list