[Tutor] Trouble in dealing with special characters.

Sat Dec 8 18:40:10 EST 2018

On Sun, Dec 09, 2018 at 09:23:59AM +1100, Cameron Simpson wrote:
> On 07Dec2018 21:20, Steven D'Aprano <steve at pearwood.info> wrote:

# Python 2
> >>>>txt = "abcπ"
> >
> >but it is a lie, because what we get isn't the string we typed, but the
> >interpreters *bad guess* that we actually meant this:
> >
> >>>>txt
> >'abc\xcf\x80'
> 
> Wow. I did not know that! I imagined Python 2 would have simply rejected 
> such a string (out of range characters -- ordinals >= 256 -- in a "byte" 
> string).

Nope.

Python 2 tries hard to make bytes and unicode text work together. If 
your strings are pure ASCII, it "Just Works" and it seems great but on 
trickier cases it can lead to really confusing errors.

Behind the scenes, what the interpreter is doing is using some platform- 
specific codec (ASCII, UTF-8, or similar) to automatically encode/decode 
from bytes to text or vise versa. This sort of "Do What I Mean" 
processing can work, up to the point that it doesn't, then it all goes 
pearshaped and you have silent failures and hard-to-diagnose errors.

That's why Python 3 takes a hard-line policy that you cannot mix text 
and bytes (except, possibly, if one is the empty string) except by 
explicitly converting from one to the other.

-- 
Steve