[Python-Dev] Python3 "complexity" (was RFC: PEP 460: Add bytes...)

Kristján Valur Jónsson kristjan at ccpgames.com
Thu Jan 9 01:12:57 CET 2014

Just to avoid confusion, let me state up front that I am very well aware of encodings and all that, having internationalized one largish app in python 2.x.  I know the problems that 2.x had with tracking down the source of errors and understand the beautiful concept of encodings on the boundary.

For a  lot of data processing and tools, encoding isn't an issue.  Either you assume ascii, or you're working with something like latin1.  A single byte encoding.  This is because you're working with a text file that _you_ wrote.  And you're not assigning any semantics to the characters.  If there is actual "text" in there it is just english, not Norwegian or Turkish. A byte read at code 0xfa doesn't mean anything special.  It's just that, a byte with that value.  The file system doesn't have any default encoding.  A file on disk is just a file on disk consisting of bytes.  There can never be any wrong encoding, no mojibake.

With python 2, you can read that file into a string object.  You can scan for your field delimiter, e.g. a comma, split up your string, interpolate some binary data, spit it out again.  All without ever thinking about encodings.  

Even though the file is conceptually encoded in something, if you insist on attaching a particular semantic meaning to every ordinal value, whatever that meaning is is in many cases irrelevant to the program.

I understand that surrogateescape allows you to do this.  But it is an awkward extra step and forces an extra layer of needles semantics on to that guy that just wants to read a file.  Sure, vegetarians and alergics like to read the list of ingredients on everything that they eat.  But others are just omnivores and want to be able to eat whatever is on the table, and not worry about what it is made of.
And yes, you can read the file in binary mode but then you end up with those bytes objects that we have just found that are tedious to work with.

So, what I'm saying is that at least I have a very common use case that has just become a) more confusing (having to needlessly derail the train of thought about the data processing to be done by thinking about text encodings) and b) more complicated.
Not sure if there is anything to be done about it though :)

I think there might be a different analogy:  Having to specify an encoding is like having strong typing.  In Python 2.7, we _can_ forego that and just duck-type our strings :)

From: Python-Dev [python-dev-bounces+kristjan=ccpgames.com at python.org] on behalf of R. David Murray [rdmurray at bitdance.com]
Sent: Wednesday, January 08, 2014 23:40
To: python-dev at python.org
Subject: Re: [Python-Dev] Python3 "complexity" (was RFC: PEP 460: Add   bytes...)

Why *do* you care?  Isn't your system configured for utf-8, and all your
.txt files encoded with utf-8 by default?  Or at least configured
with a single consistent encoding?  If that's the case, Python3
doesn't make you think about the encoding.  Knowing the right encoding
is different from needing to know the difference between text and bytes;
you only need to worry about encodings when your system isn't configured
consistently to begin with.

If you do have to care, your little utilities only work by accident in
Python2, and must have produced mojibake when the encoding was wrong,
unless I'm completely confused.  So yeah, sorting that out is harder if
you were just living with the mojibake before...but if so I'm surprised
you haven't wanted to fix that before this.

More information about the Python-Dev mailing list