clean up html document created by Word

jd chimalus at gmail.com
Sat Mar 31 03:17:19 CEST 2007


Wow, thanks for all the great responses!

Here's my summary:

- demoronizer (from John Walker) is designed to solve some very
particular problems that could be considered bugs.  However, it
doesn't remove the  unnecessary html generated by Word.
http://www.fourmilab.ch/webtools/demoroniser/


- The tool from Microsoft can be used in two ways: you can copy html
to the clipboard or export to "compact html".  The former results in
slightly cleaner html but doesn't include the style sheet and so the
rendering isn't as nice; the latter does include the style sheet but
it's got slightly more junk in it.  Both approaches preserve the
"blank" paragraphs (basically, <p>&nbsp;</p>) for spacing, which is
unnecessary and clutters up the html. This tool did properly preserve
the footnotes in my test document.
http://www.microsoft.com/downloads/details.aspx?FamilyID=209ADBEE-3FBD-482C-83B0-96FB79B74DED&displaylang=EN

BTW, I didn't know this, but much of the extra html was added by
Microsoft to allow round-tripping between html and Word.

- Tidy with Win2000 configuration: It's already bundled in with my
editor (PSPad) so this was a nice surprise (I guess I never explored
that submenu -- that's the "problem" with modern editors and their
zillions of features).  The tidy output could use a more whitespace to
improve html readability, but I assume I can change the config file to
do this.  No "blank paragraphs" (better than the Microsoft tool) but
footnotes were messed up.
http://www.w3.org/People/Raggett/tidy/

-- jeff




More information about the Python-list mailing list