On 12 January 2014 02:33, M.-A. Lemburg <mal@egenix.com> wrote:
On 11.01.2014 16:34, Nick Coghlan wrote:
While that was an *expedient* (and, in fact, necessary) solution at the time, the fact it is still thoroughly confusing people 13 years later shows it is not a *comprehensible* solution.
FWIW: I quite liked the Python 2 model, but perhaps that's because I already knww how Unicode works, so could use it to make my life easier ;-)
Right, I tried to capture that in http://python-notes.curiousefficiency.org/en/latest/python3/questions_and_an... by pointing out that there are two *very* different kinds of code to consider when discussing text modelling. Application code lives in a nice clean world of structured data, text data and binary data, with clean conversion functions for switching between them. Boundary code, by contrast, has to deal with the messy task of translating between them all. The Python 2 text model is a convenient model for boundary code, because it implicitly allows switch between binary and text interpretations of a data stream, and that's often useful due to the way protocols and file formats are designed. However, that kind of implicit switching is thoroughly inappropriate for *application* code. So Python 3 switches the core text model to one where implicitly switching between the binary domain and the text domain is considered a *bad* thing, and we object strongly to any proposals which suggest blurry the boundaries again, since that is going back to a boundary code model rather than an application code one. I've been saying for years that we may need a third type, but it has been nigh on impossible to get boundary code developers to say anything more useful than "I preferred the Python 2 model, that was more convenient for me". Yes, we know it was (we do maintain both of them, after all, and did the update for the standard library's own boundary code), but application developers are vastly more common, so boundary code developers lost out on that one and we need to come up with solutions that *respect* the Python 3 text model, rather than trying to change it back to the Python 2 one.
Seriously, Unicode has always caused heated discussions and I don't expect this to change in the next 5-10 years.
The point is: there is no 100% perfect solution either way and when you acknowledge this, things don't look black and white anymore, but instead full of colors :-)
It would be nice if more boundary code developers actually did that rather than coming out with accusatory hyperbole and pining for the halcyon days of Python 2 where the text model favoured their use case over that of normal application developers.
Python 3 forces people to actually use Unicode; in Python 2 they could easily avoid it. It's good to educate people on how it's used and the issues you can run into, but let's not forget that people are trying to get work done and we all love readable code.
PEP 460 just adds two more methods to the bytes object which come in handy when formatting binary data; I don't think it has potential to muddy the Python 3 text model, given that the bytes object already exposes a dozen of other ASCII text methods :-)
I dropped my objections to PEP 460 once Antoine fixed it to respect the boundaries between binary and text data. It's now a pure binary interpolation proposal, and one I think is a fine idea - there's no implicit encoding or decoding involved, it's just a tool for manipulating binary data. That leaves the implicit encoding and decoding to the third party asciistr type, as it should be.
asciistr is interesting in that it coerces to bytes instead of to Unicode (as is the case in Python 2).
Not quite - the idea of asciistr is that it is designed to be a *hybrid* type, like str was in Python 2. If it interacts with binary objects, it will give a binary result, if it interacts with text objects, it will give a text result. This makes it potentially suitable for use for constants in hybrid binary/text APIs like urllib.parse, allowing them to be implemented using a shared code path once again. The initial experimental implementation only works with 7 bit ASCII, but the UTF-8 caching in the PEP 393 implementation opens up the possibility of offering a non-strict mode in the future, as does the option of allowing arbitrary 8-bit data and disallowing interoperation with text strings in that case.
At the moment it doesn't cover the more common case bytes + str, just str + bytes, but let's assume it would,
Right, I suspect we have some overbroad PyUnicode_Check() calls in CPython that will need to be addressed before this substitution works seamlessly - that's one of the reasons I've been asking people to experiment with the idea since at least 2010 and let us know what doesn't work (nobody did though, until Benno agreed to try it out because it sounded like an interesting puzzle - I guess everyone else just found it easier to accuse us of being clueless idiots rather than considering trying to meet us halfway).
then you'd write
... headers += asciistr('Length: %i bytes\n' % 123)
If you're going to wait until *after* the formatting to do the conversion, you may as well just use encode explicitly: headers += ('Length: %i bytes\n' % 123).encode('ascii') The advantage of asciistr is that it allows you to abstract away the format strings for the headers in a way explicit encoding doesn't allow: FMT_LENGTH = asciistr('Length: %i bytes\n') headers += FMT_LENGTH % 123 headers += b'\n\n' body = b'...' socket.send(headers + body) You could do it inline as well: headers += asciistr('Length: %i bytes\n') % 123 But again, that doesn't offer a lot over simply explicitly encoding that fragment as ASCII.
With PEP 460, you could write the above as: ... headers += b'Length: %i bytes\n' % 123 headers += b'\n\n' body = b'...' socket.send(headers + body) ...
IMO, that's more readable.
At the cost of introducing an implicit encoding step again - it interpolates numbers into arbitrary binary sequences as ASCII text. That is thoroughly inappropriate in Python 3 - serialising semantically significant structured data (like numbers) as ASCII must always be opt in, either through environmental configuration (which has its own problems due to some undesirable default behaviour on POSIX systems - users will "opt in" to ASCII by mistake, not because they actually intended to), by passing it as an encoding argument, or by using a third party type like asciistr that is explicitly documented as only working with ASCII compatible data (whereas, with a couple of minor exceptions inherited from Python 2, the core bytes type is designed to work *correctly* with arbitrary binary data, and just has some *convenience* operations that assume ASCII data).
Both variants essentially do the same thing: they implicitly coerce ASCII text strings to bytes, so conceptually, there's little difference.
There's all the difference in the world: asciistr is a separate third party type that is deliberately designed to only work correctly with ASCII compatible binary data. If you use it for data that *isn't* ASCII compatible, then the resulting data corruption is due to using the wrong type, rather than being an implicit behaviour of a builtin Python type. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia