It's the naive user who will be surprised by these random UTF-8 decoding errors.
That's why this is NOT a convenience issue (are you listening MAL???). It's a short and long term simplicity issue. There are lots of languages where it is de rigeur to discover and work around inconvenient and confusing default behaviors. I just don't think that we should be ADDING such behaviors.
So what do you think of my new proposal of using ASCII as the default "encoding"? It takes care of "a character is a character" but also (almost) guarantees an error message when mixing encoded 8-bit strings with Unicode strings without specifying an explicit conversion -- *any* 8-bit byte with the top bit set is rejected by the default conversion to Unicode.
I think this is less confusing than Latin-1: when an unsuspecting user is reading encoded text from a file into 8-bit strings and attempts to use it in a Unicode context, an error is raised instead of producing garbage Unicode characters.
It encourages the use of Unicode strings for everything beyond ASCII -- there's no way around ASCII since that's the source encoding etc., but Latin-1 is an inconvenient default in most parts of the world. ASCII is accepted everywhere as the base character set (e.g. for email and for text-based protocols like FTP and HTTP), just like English is the one natural language that we can all sue to communicate (to some extent).
--Guido van Rossum (home page: http://www.python.org/%7Eguido/)