[Tutor] unprintable characters from MSWord (fwd)

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Fri Nov 7 13:14:37 EST 2003

Hi Jonathan,

I'm forwarding this to Python-Tutor, so that the others there might be
able to help.  In general, we try to keep the conversation on list just in
case one of us gets hit by a bus or something... *grin*

---------- Forwarded message ----------
Date: Fri, 7 Nov 2003 09:52:17 -0500
From: Jonathan Soons <jsoons at juilliard.edu>
To: Danny Yoo <dyoo at hkn.eecs.berkeley.edu>
Subject: RE: [Tutor] unprintable characters from MSWord

Here's what I'd like to do:

-- Make a dictionary of character substitutions.
-- Find all the octal values of the characters in the line.
-- Substitute.

There is no need to guess what these codes are.
I just want to find them and substitute.
(I have no knowledge of what word processors people
are using when they visit the web form.)

Thanks for the ord() tip.
I will try:

for i in range(0, len (string)-1) :
    print ord(string[i])

Then I will have the integer values. Then put
these in a dictionary = {BadChar:asciiChar, ...:..., }.

BadCharTuple = keys(dictionary)
for char in BadCharTuple :
    goodtxt = origtxt.replace(BadChar, dictionary[BadChar]

I will know if this works later today.
But tell me if it is obviously flawed.

Thank you
jon soons

-----Original Message-----
From: Danny Yoo [mailto:dyoo at hkn.eecs.berkeley.edu]
Sent: Thursday, November 06, 2003 10:26 PM
To: Jonathan Soons
Cc: Tutor
Subject: Re: [Tutor] unprintable characters from MSWord

On Thu, 6 Nov 2003, Jonathan Soons wrote:

> I have to parse text that seems to be cut and pasted from a word
> document into a web form. It looks like this:
> ^II shouldM-^Rnt attempt any of the great solosM-^WWagnerM-^Rs Tristan.
> (From the `cat -tev` output).
> I am guessing it is from a word processor.

> How can I find out what these characters are and convert them to ascii
> equivalents?

Hi Jonathan,

Hmmm... I'm not sure what '^I' means, but I'm guessing that 'M-' could be
the start of some escape character.

>>> ord("'")
>>> ord("R")
>>> 82 - 39

Hmmm... If we assume that 'M-^R' is really meant to be "'", and if we're
lucky enough that the encoding is similar to ASCII, then perhaps something
like this might work:

>>> chr(ord('W') - 43)
>>> chr(ord('R') - 43)
>>> def decodeEscape(ch):
...     return chr(ord(ch) - 43)
>>> decodeEscape('R')

It's possible that

> ^II shouldM-^Rnt attempt any of the great solosM-^WWagnerM-^Rs Tristan.

could translate to:

    ^II shouldn't attempt any of the great solos,Wagner's Tristan.

But this is just a real wild guess here.  *grin*  We need more data.

What is the word processor that you're cutting and pasting from?  And do
you have more samples of text, as well as the proper translations for us
to test against?

Talk to you later!

More information about the Tutor mailing list