HTML "sanitizer" in Python

tavares at connix.com tavares at connix.com
Wed Apr 28 16:54:13 EDT 1999


In article <s72703fc.021 at holnam.com>,
  "Scott Stirling" <SSTirlin at holnam.com> wrote:
> Hi,
>
> I am new to Python.  I have an idea of a work-related project I want to do,
> and I was hoping some folks on this list might be able to help me realize it.
> I have Mark Lutz' _Programming Python_ book, and that has been a helpful
> orientation.  I like his basic packer and unpacker scripts, but what I want
> to do is something in between that basic program and its later, more complex
> manifestations.
>
> I am on a Y2K project with 14 manufacturing plants, each of which has an
> inventory of  plant process components that need to be tested and/or
> replaced. I want to put each plant's current inventory on the corporate
> intranet on a weekly or biweekly basis.  All the plant data is in an Access
> database.  We are querying the data we need and importing into 14 MS Excel 97
> spreadsheets.  Then we are saving the Excel sheets as HTML.  The HTML files
> bloat out with a near 100% increase in file size over the original Excel
> files.  This is because the HTML converter in Excel adds all kinds of
> unnecessary HTML code, such as <FONT FACE="Times New Roman"> for every
> single cell in the table.  Many of these tables have over 1000 cells, and
> this code, along with its accompanying closing FONT tag, add up quick.
> The other main, unnecessary code is the ALIGN="left" attribute in <TD>
> tags (the default alignment _is_ left).  The unnecessary tags are
> consistent and easy to identify, and a routine should be writable that
> will automate the removal of them.
>
> I created a Macro in Visual SlickEdit that automatically opens all these
> HTML files, finds and deletes all the tags that can be deleted, saves the
> changes and closes them.  I originally wanted to do this in Python, and I
> would still like to know how, but time constraints prevented it at the
> time.  Now I want to work on how to create a Python program that will do
> this.  Can anyone help?  Has anyone written anything like this in Python
> already that they can point me too? I would really appreciate it.
>

Well, it wouldn't be that hard in Python to parse the HTML files and reformat
them in various ways. You can either go the route of straight text
substitution using regular expressions, or you could use htmllib to actually
parse the HTML files into a data structure, and the write them back out
again.

However, may I suggest a different method?

You've got your original data in Access. There are several different ways to
talk to Access from Python. You could pull your data directly from Access
using Python and skip Excel all together. And Python's got some great modules
for generating HTML. Heck, add CGI or Zope to the mix and you could generate
your inventory lists at the web server on the fly!

Ok, I'll calm down now.

-Chris

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/       Search, Read, Discuss, or Start Your Own    




More information about the Python-list mailing list