character-filtering and Word (& company)

Mike Meyer mwm at mired.org
Sat Mar 26 18:22:30 EST 2005


Charles Hartman <charles.hartman at conncoll.edu> writes:

> I'm working on text-handling programs that want plain-text files as
> input. It's fine to tell users to feed the programs with plain-text
> only, but not all users know what this means, even after you explain
> it, or they forget. So it would be nice to be able to handle
> gracefully the stuff that MS Word (or any word-processor) puts into a
> file. Inserting a 0-127 filter is easy but not very
> friendly. Typically, the w.p. file loads OK (into a wx.StyledTextCtrl
> a.k.a Scintilla editing pane), and mostly be readable. Just a few
> characters will be wrong: "smart" quotation marks and the like.
>
> Is there some well-known way to filter or translate this w.p. garbage?
> I don't know whether encodings are relevant;

Bingo. You need to figure out the encoding before you can do
intelligent translation of the non-ASCII characters in the text.

> I don't know what encoding an MSW file uses.

Different WPs will use different encodings. Especially when you start
working in a cross-platform environment.

I don't know that there is a good solution to this problem. It
certainly hasn't been sovled on the web.

          <mike
-- 
Mike Meyer <mwm at mired.org>			http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.



More information about the Python-list mailing list