character-filtering and Word (& company)
charles.hartman at conncoll.edu
Fri Mar 25 23:54:05 CET 2005
I'm working on text-handling programs that want plain-text files as
input. It's fine to tell users to feed the programs with plain-text
only, but not all users know what this means, even after you explain
it, or they forget. So it would be nice to be able to handle gracefully
the stuff that MS Word (or any word-processor) puts into a file.
Inserting a 0-127 filter is easy but not very friendly. Typically, the
w.p. file loads OK (into a wx.StyledTextCtrl a.k.a Scintilla editing
pane), and mostly be readable. Just a few characters will be wrong:
"smart" quotation marks and the like.
Is there some well-known way to filter or translate this w.p. garbage?
I don't know whether encodings are relevant; I don't know what encoding
an MSW file uses. I don't see how to use s.translate() because I don't
know how to predict what the incoming format will be.
Any hints welcome.
More information about the Python-list