[Tutor] Tricky parsing problem

Tim Wilson wilson at visi.com
Thu Jun 3 10:11:15 EDT 2004


Hey everyone,

I've got a stack of old MS Word document that need to be converted to HTML.
Unfortunately, Word's HTML export is not an option because the document's
are essentially large outlines and the original author didn't use Word's
outlining features, but created the outlines manually using spaces for
indenting. As a result, the HTML output from Word doesn't use HTML ordered
lists.

We can export as plain text, of course, but the documents can be fairly
complex and it's a time-intensive task to manually create all the proper
HTML.

Naturally, I thought of using a Python script to process these files and
create the nested outline structure. If you're interested, a typical
document can be viewed at
http://www.hopkins.k12.mn.us/pages/district/policies/iplcy/iiaa.htm in all
its ugliness. As you can see if you view the source, the structure of the
document is a real mess.

Can anyone recommend a module that might have some features that would
assist in this task?

Any general programming principles that would be good to keep in mind?

-Tim

-- 
Tim Wilson
Twin Cities, Minnesota, USA
Educational technology guy, Linux and OS X fan, Grad. student, Daddy
mailto: wilson at visi.com   aim: tis270   public key: 0x8C0F8813




More information about the Tutor mailing list