[Python-Dev] unicode Exception messages in py2.7

Thu Nov 14 18:32:10 CET 2013

Folks,

(note this is about 2.7 -- sorry, but a lot of us still use that! I
can only assume that in 3.* this is a non-issue)

I just discovered an issue that's been around a long time:

If you create an Exception with a unicode object for the message, the
message can be silently ignored if it can not be encoded to ASCII (or,
more properly, the default encoding).

In my use-case, I was parsing a text file (utf-8), and wanted a bit of
that text to be part of the Exception message (an error reading the
file, I wanted the user to know what the text was surrounding the
ill-formated part of the text file).

What I got was a blank message, and it took a lot of poking at it to
figure out why.

My solution was:

                    msg = u"Problem with line %i: %s This is not a
valid time slot"%(linenum, line)
                    raise ValueError(msg.encode('ascii', 'ignore'))

which is really pretty painfully clunky.

This is an issue brought up in various tutorial and blog posts, and
all the solutions I've seen involve some similar clunkiness.

I also found this issue in the issue tracker:

http://bugs.python.org/issue2517

Which was resolved years ago, but as far as I can tell, only solved
the problem of being able to do:

unicode(an_exception)

and get the proper unicode message object. But we still can't raise
the darn thing and expect the user to see the message.

Why is this the case? I can print a unicode object to the terminal,
why can't raising an Exception print a unicode object?

I can imagine for backward compatibility, or maybe for non-unicode
terminals, or ??? Exceptions do need to print as ascii. However,
having a message simply get swallowed up and disappear seems like the
wrong solution.

 - auto-conversion to a default encoding is fraught with problems all
over the board -- I know that. I also know that too much code would
break too often if we didn't have auto-conversion.

 - for the most part, the auto-conversion uses 'strict' mode -- I
generally dislike this, as it means code crashes when  odd stuff gets
introduced after testing, but I can see why it is done.

 - However, I can see why for raising Exceptions, the decision was
made to swallow that error, so that the actual Exception intended is
raised, rather than a new UnicodeEncodeError.

 - But combining 'strict' with ignoring the encoding exception seems
like the worst of both worlds.

So a proposal:

Use 'replace" mode for the encoding to the default, and at least the
user would see SOMETHING of the message. In a common case, it would be
a lot of ascii, and in the worse case it would be a lot of question
marks -- still better than a totally blank message.

Another option would be to use the str(repr(the_message)) so the user
would get the escaped version. Though I think that would be more ugly.

What am I missing? This seems so obvious, and easy to do (though maybe
it's buried in the C implementation of Exceptions)

-Chris

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov