HTML "sanitizer" in Python

Scott Stirling SSTirlin at holnam.com
Thu Apr 29 09:42:18 EDT 1999


Will,

Thank you.  So far you are the only person who has offered the kind of practical HOW-TO that I was mainly hoping for!  This is not to disparage the many other helpful and interesting suggestions I have received.

I should reiterate that I have 14 fairly large HTML files that I want to _batch process_, taking out a few specific HTML tags that Excel adds unnecessarily.  I don't have the time or the inclination to write an HTML generator and process the Access data from scratch.  I also have to work with a team of people who don't care at all about doing things smarter or trying out new programming languages.

Besides, someone on the team has already put a lot of effort into writing a VB program that batch processes the Excel sheets from an Access query.  And, as I said, I have a Visual SlickEdit macro that does exactly what I need very quickly.  I am out to learn a little Python more than anything.  So, while any more suggestions and comments are welcome, I will ask some more specific questions in the meantime.  And then you can see how far I am from writing even the simplest program in Python!

1) What is the Python syntax for opening a file in MS Windows?  I was following Guido's tutorial yesterday, but I could not figure out how to open a file in Windows.

2) How do I find a string of text in the open file and delete it iteratively?

3) How do I save the file in Windows after I have edited it with the Python program?  How do I close it?

4) If someone helps me out, I think I should be able to use this info. and the tutorial and the Lutz book to loop the process and make the program run until all *.htm files in a folder have been handled once.

What do you say?

Scott
>>> William Park <parkw at better.net> 04/28 3:20 PM >>>
On Wed, Apr 28, 1999 at 12:49:55PM -0400, Scott Stirling wrote:
> Hi,
> 
> I am new to Python.  I have an idea of a work-related project I want
> to do, and I was hoping some folks on this list might be able to
> help me realize it.  I have Mark Lutz' _Programming Python_ book,
> and that has been a helpful orientation.  I like his basic packer
> and unpacker scripts, but what I want to do is something in between
> that basic program and its later, more complex manifestations.
> 
> I am on a Y2K project with 14 manufacturing plants, each of which
> has an inventory of  plant process components that need to be tested
> and/or replaced.  I want to put each plant's current inventory on
> the corporate intranet on a weekly or biweekly basis.  All the plant
> data is in an Access database.  We are querying the data we need and
> importing into 14 MS Excel 97 spreadsheets.  Then we are saving the
> Excel sheets as HTML.  The HTML files bloat out with a near 100%
> increase in file size over the original Excel files.  This is
> because the HTML converter in Excel adds all kinds of unnecessary
> HTML code, such as <FONT FACE="Times New Roman"> for every single
> cell in the table.  Many of these tables have over 1000 cells, and
> this code, along with its accompanying closing FONT tag, add up
> quick.  The other main, unnecessary code is the ALIGN="left"
> attribute in <TD> tags (the default alignment _is_ left).  The
> unnecessary tags are consistent and easy to identify, and a routine
> sh!
> ould be writable that will automate the removal of them.
> 
> I created a Macro in Visual SlickEdit that automatically opens all
> these HTML files, finds and deletes all the tags that can be
> deleted, saves the changes and closes them.  I originally wanted to
> do this in Python, and I would still like to know how, but time
> constraints prevented it at the time.  Now I want to work on how to
> create a Python program that will do this.  Can anyone help?  Has
> anyone written anything like this in Python already that they can
> point me too?  I would really appreciate it.
> 
> Again, the main flow of the program is:
> 
> >> Open 14 HTML files, all in the same folder and all with the .html
> >> extension.  Find certain character strings and delete them from
> >> the files.  In one case (the <TD> tags) it is easier to find the
> >> whole tag with attributes and then _replace_ the original tag
> >> with a plain <TD>.  Save the files.  Close the files.  Exit the
> >> program.

Hi Scott,

I shall assume that a <TD ...> tag occurs in one line.  Try 'sed',
    for i in *.html
    do sed -e 's/<TD ALIGN="left">/<TD>/g" $i > /tmp/$i && mv /tmp/$i $i
    done
or, in Python,
    for s in open('...', 'r').readlines():
	s = string.replace('<TD ALIGN="left">', '<TD>', s)
	print string.strip(s)
	
If <TD ...> tag spans over more than one line, then read the file in
whole, like
    for s in open('...', 'r').read():

If the tag is not consistent, then you may have to use regular
expression with 're' module.

Hopes this helps.
William


> 
> More advanced options would be the ability for the user to set
> parameters for the program upon running it, to keep from hard-coding
> the find and replace parms.

To use command line parameters, like
    $ cleantd 'ALIGN="left"' 
change to
	s = string.replace('<TD %s>' % sys.argv[1], '<TD>', s)

> 
> OK, thanks to any help you can provide.  I partly was turned on to
> Python by Eric Raymond's article, "How to Become a Hacker" (featured
> on /.).  I use Linux at home, but this program would be for use on a
> Windows 95 platform at work, if that makes any difference.  I do
> have the latest Python interpreter and editor for Windows here at
> work.
> 
> Yours truly,
> Scott
> 
> Scott M. Stirling
> Visit the HOLNAM Year 2000 Web Site: http://web/y2k 
> Keane - Holnam Year 2000 Project
> Office:  734/529-2411 ext. 2327 fax: 734/529-5066 email: sstirlin at holnam.com 
> 
> 
> -- 
> http://www.python.org/mailman/listinfo/python-list 

-- 
http://www.python.org/mailman/listinfo/python-list

__________________________________________________________________
|  Scott M. Stirling                                                                                                                        |
|  Visit the HOLNAM Year 2000 Web Site: http://web/y2k                                            |
|  Keane - Holnam Year 2000 Project                                                                                   |
|  Office:  734/529-2411 ext. 2327 fax: 734/529-5066 email: sstirlin at holnam.com  |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~




More information about the Python-list mailing list