[Tutor] Tricky parsing problem
wilson at visi.com
Thu Jun 3 10:11:15 EDT 2004
I've got a stack of old MS Word document that need to be converted to HTML.
Unfortunately, Word's HTML export is not an option because the document's
are essentially large outlines and the original author didn't use Word's
outlining features, but created the outlines manually using spaces for
indenting. As a result, the HTML output from Word doesn't use HTML ordered
We can export as plain text, of course, but the documents can be fairly
complex and it's a time-intensive task to manually create all the proper
Naturally, I thought of using a Python script to process these files and
create the nested outline structure. If you're interested, a typical
document can be viewed at
http://www.hopkins.k12.mn.us/pages/district/policies/iplcy/iiaa.htm in all
its ugliness. As you can see if you view the source, the structure of the
document is a real mess.
Can anyone recommend a module that might have some features that would
assist in this task?
Any general programming principles that would be good to keep in mind?
Twin Cities, Minnesota, USA
Educational technology guy, Linux and OS X fan, Grad. student, Daddy
mailto: wilson at visi.com aim: tis270 public key: 0x8C0F8813
More information about the Tutor