accept string in a2b and base64?
Two patches have been committed to 3.3 that I am very uncomfortable with. See issue 13637 and issue 13641, respectively. It seems to me that part of the point of the byte/string split (and the lack of automatic coercion) is to make the programmer be explicit about converting between unicode and bytes. Having these functions, which convert between binary formats (ASCII-only representations of binary data and back) accept unicode strings is reintroducing automatic coercions, and I think it will lead to the same kind of bugs that automatic string coercions yielded in Python2: a program works fine until the input turns out to have non-ASCII data in it, and then it blows up with an unexpected UnicodeError. You can see Antoine's counter arguments in the issue, and I'm sure he'll chime in here. If most people agree with Antoine I won't fight it, but it seems to me that accepting unicode in the binascii and base64 APIs is a bad idea. I'm on vacation this week so I may not be very responsive on this thread, but unless other people agree with me (and will therefore advance the relevant arguments) the thread can die and the patches can stay in. --David
On Mon, 20 Feb 2012 20:24:16 -0500 "R. David Murray" <rdmurray@bitdance.com> wrote:
It seems to me that part of the point of the byte/string split (and the lack of automatic coercion) is to make the programmer be explicit about converting between unicode and bytes. Having these functions, which convert between binary formats (ASCII-only representations of binary data and back) accept unicode strings is reintroducing automatic coercions,
Whether a baseXX representation is binary or text can probably be argued endlessly. As a data point, hex() returns str, not bytes, so at least base16 can be considered (potentially) text. And the point of baseXX representations is generally to embed binary data safely into text, which explains why you may commonly need to baseXX-decode some chunk of text. This occurred to me when porting Twisted to py3k; I'm sure other networking code would also benefit. Really, I think there's no problem with coercions when they are unambiguous and safe (which they are, in the committed patches). They make writing and porting code easier. For example, we already have:
int("10") 10 int(b"10") 10
Regards Antoine.
On Tue, Feb 21, 2012 at 11:24 AM, R. David Murray <rdmurray@bitdance.com> wrote:
If most people agree with Antoine I won't fight it, but it seems to me that accepting unicode in the binascii and base64 APIs is a bad idea.
I see it as essentially the same as the changes I made in urllib.urlparse to support pure ASCII bytes->bytes in many of the APIs (which work by doing an implicit ascii+strict decode at the beginning of the function, and then reversing that at the end). For those, if your byte sequence has non-ASCII data in it, they'll throw a UnicodeDecodeError and it's up to you to figure out where those non-ASCII bytes are coming from. Similarly, if one of these updated APIs throws ValueError, then you'll have to figure out where the non-ASCII code points are coming from. Yes, it's a niggling irritation from a purist point of view, but it's also an acknowledgement of the fact that whether a pure ASCII sequence should be treated as a sequence of bytes or a sequence of code points is going to be application and context depended. Sometimes it will make more sense to treat it as binary data, other times as text. The key point is that any multimode support that depends on implicit type conversion from bytes->str (or vice-versa) really needs to be limited to *strict* ASCII only (if no other information on the encoding is available). If something is 7-bit ASCII pure, then odds are very good that it really *is* ASCII text. As soon as that high-order bit gets set though, all bets are off and we have to push the text encoding problem back on the API caller to figure out. The reason Python 2's implicit str<->unicode conversions are so problematic isn't just because they're implicit: it's because they effectively assume *latin-1* as the encoding on the 8-bit str side. That means reliance on implicit decoding can silently corrupt non-ASCII data instead of triggering exceptions at the point of implicit conversion. If you're lucky, some *other* part of the application will detect the corruption and you'll have at least a vague hope of tracking it down. Otherwise, the corrupted data may escape the application and you'll have an even *thornier* debugging problem on your hands. My one concern with the base64 patch is that it doesn't test that mixing types triggers TypeError. While this shouldn't require any extra code (the error should arise naturally from the method implementation), it should still be tested explicitly to ensure type mismatches fail as expected. Checking explicitly for mismatches in the code would then just be a matter of wanting to emit nice error messages explaining the problem rather than being needed for correctness reasons (e.g. urlparse uses pre-checks in order to emit a clear error message for type mismatches, but it has significantly longer function signatures to deal with). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Nick Coghlan wrote:
The reason Python 2's implicit str<->unicode conversions are so problematic isn't just because they're implicit: it's because they effectively assume *latin-1* as the encoding on the 8-bit str side.
The implicit conversion in Python2 only works with ASCII content, pretty much like what you describe here. Note that e.g. UTF-16 is not an ASCII super set, but the ASCII assumption still works:
u'abc'.encode('utf-16-le').decode('ascii') u'a\x00b\x00c\x00'
Apart from that nit (which can be resolved in most cases by disallowing 0 bytes), I still believe that the Python2 implicit conversion between Unicode and 8-bit strings is a very useful feature in practice. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 21 2012)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2012-02-13: Released eGenix pyOpenSSL 0.13 http://egenix.com/go26 2012-02-09: Released mxODBC.Zope.DA 2.0.2 http://egenix.com/go25 2012-02-06: Released eGenix mx Base 3.2.3 http://egenix.com/go24 ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
On Tue, 21 Feb 2012 12:51:08 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
My one concern with the base64 patch is that it doesn't test that mixing types triggers TypeError. While this shouldn't require any extra code (the error should arise naturally from the method implementation), it should still be tested explicitly to ensure type mismatches fail as expected.
I don't think mixing types is a concern. The extra parameters to the base64 functions aren't mixed into the original string, they are used to modify the decoding algorithm. So it's like typing `open(b"LICENSE", "r")`: the fast that `b"LICENSE"` is bytes while `"r"` is str isn't really a problem. Regards Antoine.
On Tue, Feb 21, 2012 at 10:28 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
So it's like typing `open(b"LICENSE", "r")`: the fast that `b"LICENSE"` is bytes while `"r"` is str isn't really a problem.
Ah, right - I misunderstood how the different arguments were being used. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
R. David Murray writes:
If most people agree with Antoine I won't fight it, but it seems to me that accepting unicode in the binascii and base64 APIs is a bad idea.
First, I agree with David that this change should have been brought up on python-dev before committing it. The distinctions Python 3 has made between APIs for bytes and those for str are both obviously controversial and genuinely delicate. Second, if Unicode is to be accepted in these APIs, there is a doc issue (which I haven't checked). It must be made clear that the "printable ASCII" is question is the set represented by the *integers* 33 to 126, *not* the ASCII characters ! to ~. Those characters are present in the Unicode repertoire in many other places (specifically the "full-width ASCII" compatibility character set around U+FF20, but also several Greek and Cyrillic characters, and possibly others.) I'm going to side with Antoine and Nick on these particular changes because in practice (except maybe in the email module :-( ) the BASE-encoded "text" to be decoded is going to be consistently defined by the client as either str or bytes, but not both. The fact that the repr of the encoded text is identical (except for the presence or absence of a leading "b") is very suggestive here. I do harbor a slight niggle that I think there is more room for confusion here than in Nick's urllib work. However, once we clarify that confusion in *our* minds, I don't think there's much potential for dangerous confusion for API clients. (I agree with Antoine on that point.) The BASE## decoding APIs in abstract are "text" to bytes. Pedantically in Python that suggests a str -> bytes signature, but RFC 4648 doesn't anywhere require a 1-byte representation of ASCII, only that the representation be interpreted as integers in the ASCII coding. However, an RFC-4648-conforming implementation MUST reject any string containing characters not allowed in the representation, so it's actually stricter than requiring ASCII. I see no problem with allowing str-or-bytes -> bytes polymorphism here. The remaining issue to my mind is we'd also like bytes -> str-or-bytes polymorphism for symmetry, but this is not Haskell, we can't have it. The same is true for binascii, I suppose -- assuming that the module is specified (as the name suggests) to produce and consume only ASCII text as a representation of bytes.
It seems to me that part of the point of the byte/string split (and the lack of automatic coercion) is to make the programmer be explicit about converting between unicode and bytes. Having these functions, which convert between binary formats (ASCII-only representations of binary data and back) accept unicode strings is reintroducing automatic coercions, and I think it will lead to the same kind of bugs that automatic string coercions yielded in Python2: a program works fine until the input turns out to have non-ASCII data in it, and then it blows up with an unexpected UnicodeError.
I agree with the change in principle, but I also agree in the choice of error with you: py> binascii.a2b_hex("MURRAY") Traceback (most recent call last): File "<stdin>", line 1, in <module> binascii.Error: Non-hexadecimal digit found py> binascii.a2b_hex("VLÖWIS") Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: string argument should contain only ASCII characters I think it should give binascii.Error in both cases: Ö is as much a non-hexadecimal digit as M. With that changed, I'd have no issues with the patch: these functions are already fairly strict in their input, whether it's bytes or Unicode. So the chances that non-ASCII characters get it to fall over in a way that never causes problems in pure-ASCII communities are very low.
If most people agree with Antoine I won't fight it, but it seems to me that accepting unicode in the binascii and base64 APIs is a bad idea.
No - it's only the choice of error that is a bad idea. Regards, Martin
participants (6)
-
"Martin v. Löwis" -
Antoine Pitrou -
M.-A. Lemburg -
Nick Coghlan -
R. David Murray -
Stephen J. Turnbull