[Tutor] unprintable characters from MSWord

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Thu Nov 6 22:25:57 EST 2003



On Thu, 6 Nov 2003, Jonathan Soons wrote:

> I have to parse text that seems to be cut and pasted from a word
> document into a web form. It looks like this:
>
> ^II shouldM-^Rnt attempt any of the great solosM-^WWagnerM-^Rs Tristan.
>
> (From the `cat -tev` output).
> I am guessing it is from a word processor.

> How can I find out what these characters are and convert them to ascii
> equivalents?



Hi Jonathan,


Hmmm... I'm not sure what '^I' means, but I'm guessing that 'M-' could be
the start of some escape character.

###
>>> ord("'")
39
>>> ord("R")
82
>>> 82 - 39
43
###


Hmmm... If we assume that 'M-^R' is really meant to be "'", and if we're
lucky enough that the encoding is similar to ASCII, then perhaps something
like this might work:

###
>>> chr(ord('W') - 43)
','
>>> chr(ord('R') - 43)
"'"
>>> def decodeEscape(ch):
...     return chr(ord(ch) - 43)
...
>>> decodeEscape('R')
"'"
###


It's possible that


> ^II shouldM-^Rnt attempt any of the great solosM-^WWagnerM-^Rs Tristan.


could translate to:

    ^II shouldn't attempt any of the great solos,Wagner's Tristan.



But this is just a real wild guess here.  *grin*  We need more data.


What is the word processor that you're cutting and pasting from?  And do
you have more samples of text, as well as the proper translations for us
to test against?


Talk to you later!




More information about the Tutor mailing list