[Tutor] Trouble in dealing with special characters.
Steven D'Aprano
steve at pearwood.info
Sat Dec 8 18:40:10 EST 2018
On Sun, Dec 09, 2018 at 09:23:59AM +1100, Cameron Simpson wrote:
> On 07Dec2018 21:20, Steven D'Aprano <steve at pearwood.info> wrote:
# Python 2
> >>>>txt = "abcπ"
> >
> >but it is a lie, because what we get isn't the string we typed, but the
> >interpreters *bad guess* that we actually meant this:
> >
> >>>>txt
> >'abc\xcf\x80'
>
> Wow. I did not know that! I imagined Python 2 would have simply rejected
> such a string (out of range characters -- ordinals >= 256 -- in a "byte"
> string).
Nope.
Python 2 tries hard to make bytes and unicode text work together. If
your strings are pure ASCII, it "Just Works" and it seems great but on
trickier cases it can lead to really confusing errors.
Behind the scenes, what the interpreter is doing is using some platform-
specific codec (ASCII, UTF-8, or similar) to automatically encode/decode
from bytes to text or vise versa. This sort of "Do What I Mean"
processing can work, up to the point that it doesn't, then it all goes
pearshaped and you have silent failures and hard-to-diagnose errors.
That's why Python 3 takes a hard-line policy that you cannot mix text
and bytes (except, possibly, if one is the empty string) except by
explicitly converting from one to the other.
--
Steve
More information about the Tutor
mailing list