Mailman 3 PEP 460: allowing %d and %f and mojibake - Python-Dev

PEP 460: allowing %d and %f and mojibake

Victor Stinner

Jan. 11, 2014

5:41 p.m.

Hi, I'm in favor of adding support of formatting integer and floatting point numbers in the PEP 460: %d, %u, %o, %x, %f with padding and precision (%10d, %010d, %1.5f) and sign (%-i, %+i) but without alternate format ("{:#x}"). %s would also accept int and float for convenience. int and float subclasses would not be handled differently, their __str__ and __format__ would be ignored. Other int-like and float-like types (ex: defining __int__ or __index__) are not supported. Explicit cast would be required. For %s, the choice between string and number is made using "(PyLong_Check() || PyFloat_Check())". If you agree, I will modify the PEP. If Antoine disagree, I will fork the PEP 460 ;-) --- %s should not support precision (ex: %.100s), use Unicode for that. --- The PEP 460 should not reintroduce bytes+unicode, implicit decoding or implement encoding. b'x=%s' % 10 is well defined, it's pure bytes. If you consider that bytes should not contain text, why does the bytes type have methods like isalpha() or upper()? And why binary files have a readline() method? A "line" doesn't mean anything in pure bytes. It's an example of "practicality beats purity". Python 3 should not enforce Unicode if the developers *chose* to use bytes to handle mixed binary/text protocols like HTTP. But I'm against of adding "%r" and "%a" because they use Unicode and would require an implicit encoding. type(ascii(obj)) is str, not bytes. If you really want to use repr() and ascii(), encode the result explicitly. Victor

Show replies by date

Georg Brandl

January 2014

6:29 p.m.

New subject: PEP 460: allowing %d and %f and NOT ALLOWING mojibake :)

Am 11.01.2014 18:41, schrieb Victor Stinner:

...

Hi,

I'm in favor of adding support of formatting integer and floatting point numbers in the PEP 460: %d, %u, %o, %x, %f with padding and precision (%10d, %010d, %1.5f) and sign (%-i, %+i) but without alternate format ("{:#x}"). %s would also accept int and float for convenience.

int and float subclasses would not be handled differently, their __str__ and __format__ would be ignored.

Other int-like and float-like types (ex: defining __int__ or __index__) are not supported. Explicit cast would be required.

For %s, the choice between string and number is made using "(PyLong_Check() || PyFloat_Check())".

If you agree, I will modify the PEP. If Antoine disagree, I will fork the PEP 460 ;-)

---

%s should not support precision (ex: %.100s), use Unicode for that.

---

The PEP 460 should not reintroduce bytes+unicode, implicit decoding or implement encoding.

b'x=%s' % 10 is well defined, it's pure bytes. If you consider that bytes should not contain text, why does the bytes type have methods like isalpha() or upper()? And why binary files have a readline() method? A "line" doesn't mean anything in pure bytes.

It's an example of "practicality beats purity". Python 3 should not enforce Unicode if the developers *chose* to use bytes to handle mixed binary/text protocols like HTTP.

But I'm against of adding "%r" and "%a" because they use Unicode and would require an implicit encoding. type(ascii(obj)) is str, not bytes. If you really want to use repr() and ascii(), encode the result explicitly.

I agree. For non-ASCII characters what ascii() gives you is almost always not what you want anyway. Georg

Antoine Pitrou

6:32 p.m.

On Sat, 11 Jan 2014 18:41:49 +0100 Victor Stinner <victor.stinner@gmail.com> wrote:

...

If you agree, I will modify the PEP. If Antoine disagree, I will fork the PEP 460 ;-)

Please fork it.

...

b'x=%s' % 10 is well defined, it's pure bytes.

It is well-defined? Then please explain me what the general case of b'%s' % x is supposed to call: - does it call x.__bytes__? int.__bytes__ doesn't exist - does it call bytes(x)? bytes(10) gives b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' - does it call x.__str__? you've reintroduced the Python 2 behaviour of conflating bytes and unicode Regards Antoine.

Ethan Furman

6:38 p.m.

On 01/11/2014 10:32 AM, Antoine Pitrou wrote:

...

On Sat, 11 Jan 2014 18:41:49 +0100 Victor Stinner <victor.stinner@gmail.com> wrote:

...
If you agree, I will modify the PEP. If Antoine disagree, I will fork the PEP 460 ;-)

Please fork it.

You've already stated you don't care that much and are willing to let the PEP as-is be rejected. Why not remove your name and let Victor have it back? Is he not the original author? (If this is protocol just say so -- remember I'm still new to the ways of PyDev. :). -- ~Ethan~

Barry Warsaw

6:14 p.m.

On Jan 11, 2014, at 10:38 AM, Ethan Furman wrote:

...

You've already stated you don't care that much and are willing to let the PEP as-is be rejected. Why not remove your name and let Victor have it back? Is he not the original author? (If this is protocol just say so -- remember I'm still new to the ways of PyDev. :).

From a procedural point of view, I would say that it's entirely appropriate for a PEP to have open questions, alternatives, and options. Have it lay out the arguments pro and con and let Guido or the appointed PEP czar make the final decision. Then the PEP can be amended with those decisions, and if folks still think more needs to be done, a follow up PEP can be filed. -Barry

Antoine Pitrou

7:22 p.m.

On Sat, 11 Jan 2014 10:38:01 -0800 Ethan Furman <ethan@stoneleaf.us> wrote:

...

On 01/11/2014 10:32 AM, Antoine Pitrou wrote:

...
On Sat, 11 Jan 2014 18:41:49 +0100 Victor Stinner <victor.stinner@gmail.com> wrote:

...
If you agree, I will modify the PEP. If Antoine disagree, I will fork the PEP 460 ;-)

Please fork it.

You've already stated you don't care that much and are willing to let the PEP as-is be rejected. Why not remove your name and let Victor have it back? Is he not the original author? (If this is protocol just say so -- remember I'm still new to the ways of PyDev. :).

Because the PEP is IMO a much saner compromise than what you're trying to do (and would also stand a better chance of being accepted, if it weren't for your stupid maximalist opposition). Regards Antoine.

Ethan Furman

7:51 p.m.

On 01/11/2014 11:22 AM, Antoine Pitrou wrote:

...

On Sat, 11 Jan 2014 10:38:01 -0800 Ethan Furman <ethan@stoneleaf.us> wrote:

...
On 01/11/2014 10:32 AM, Antoine Pitrou wrote:

...
On Sat, 11 Jan 2014 18:41:49 +0100 Victor Stinner <victor.stinner@gmail.com> wrote:

...
If you agree, I will modify the PEP. If Antoine disagree, I will fork the PEP 460 ;-)

Please fork it.

You've already stated you don't care that much and are willing to let the PEP as-is be rejected. Why not remove your name and let Victor have it back? Is he not the original author? (If this is protocol just say so -- remember I'm still new to the ways of PyDev. :).

Because the PEP is IMO a much saner compromise than what you're trying to do (and would also stand a better chance of being accepted, if it weren't for your stupid maximalist opposition).

Well, it's good to know you do care. :) -- ~Ethan~

Georg Brandl

8:50 p.m.

Am 11.01.2014 20:22, schrieb Antoine Pitrou:

...

On Sat, 11 Jan 2014 10:38:01 -0800 Ethan Furman <ethan@stoneleaf.us> wrote:

...
On 01/11/2014 10:32 AM, Antoine Pitrou wrote:

...
On Sat, 11 Jan 2014 18:41:49 +0100 Victor Stinner <victor.stinner@gmail.com> wrote:

...
If you agree, I will modify the PEP. If Antoine disagree, I will fork the PEP 460 ;-)

Please fork it.

You've already stated you don't care that much and are willing to let the PEP as-is be rejected. Why not remove your name and let Victor have it back? Is he not the original author? (If this is protocol just say so -- remember I'm still new to the ways of PyDev. :).

Because the PEP is IMO a much saner compromise than what you're trying to do (and would also stand a better chance of being accepted, if it weren't for your stupid maximalist opposition).

Can you please stop throwing personal insults around? You don't have to resort to that level. Georg

Stephen J. Turnbull

8:02 p.m.

Georg Brandl writes:

...

...
if it weren't for your stupid maximalist opposition).

Can you please stop throwing personal insults around? You don't have to resort to that level.

Ethan's posts (as an example of one general trend in this thread) are pretty frustrating, you have to admit. MAL posted straight out the Python 2 model of text makes it easier for him to write some programs, so he's all for reintroducing it. And that is the whole truth of the matter. Although I disagree with him, I appreciate his honesty. But people keep posting "we don't want Python 2's confounding of text and binary, we just want bytes with (nearly) all the functionality of strings [because they are (partially|really) encoded text]". Some of them actually use the literal word "text" in their justification! That's, well, what would you call it? Either they know what they're saying, in which case it's disingenuous at best, or they don't know what they're saying, in which case it's a proposal based on a clear misunderstanding of the situation. The problem is not going to go away just because they *say* they don't want to reintroduce Python 2 text processing. That is precisely what this proposal is *intended* to do, whether in the limited form proposed by Antoine or in the much more extensive form that folks like Ethan want. What "maximalists" mean is that they promise not to abuse Python 2 text processing when writing Python 3 programs. This promise is highly unlikely to be kept for two reasons. First, they can't make that promise on behalf of third parties, who for various reasons certainly will abuse these features to avoid the encoded-text-to- Unicode-text and vice-versa conversions. Second, I doubt they themselves will keep the promise to my satisfaction because their definition of "text" is ambiguous. When it's convenient for them to use text-processing operations on bytes, they'll say "oh, yes, these are conventionally considered text-processing features, but that's just an accident of the particular configuration of bytes -- yup, bytes -- I'm processing." You could argue that this "abuse" isn't *abuse*. That it's covered by "consenting adults". By the same token, so is smoking in a crowded elevator -- if you don't like it, don't use the elevator! Of course in applications used only by the author, there's no abuse (at least not of others! :-/ ) But Nick's important example of web frameworks demonstrates the problem: unless they convert to text where appropriate, they're just pushing the problem off on application writers. Sometimes passing on data as bytes is appropriate, of course, but the framework authors are likely to be biased in favor of doing that, and it's not hard to imagine frameworks ported from Python 2 passing on the problem wholesale on the grounds that "we returned str in Python 2 which is bytes in Python 3, and since we were processing bytes the whole time, we see no reason to change the 'ABI'." Of course the application writers thought they were receiving text "in an inconvenient and ambiguous form". IMO, with the proposed changes, that is likely to continue indefinitely, negating some of the gains I expected to receive from Python 3. :-( Note: there are a lot of high-level frameworks like Django that even in Python 2 basically went to Unicode everywhere internally. I don't deny that. I think that Python 3 as currently constituted makes it a lot easier to make an appropriate decision of where to convert, and should take some of the burden off the high-level frameworks. Approving this PEP, especially in a maximalist form, will blur the lines.

Ethan Furman

9:28 p.m.

On 01/12/2014 12:02 PM, Stephen J. Turnbull wrote:

...

Georg Brandl writes:

...
Antoine writes:

...
. . . if it weren't for your stupid maximalist opposition. . .

Can you please stop throwing personal insults around? You don't have to resort to that level.

Ethan's posts (as an example of one general trend in this thread) are pretty frustrating, you have to admit.

Two points: 1) Are you saying it's okay to be insulting when frustrated? I also find this mega-thread frustrating, but I'm trying very hard not to be insulting. 2) If you are going to use my name, please be certain of the facts [1]. More below.

...

MAL posted straight out the Python 2 model of text makes it easier for him to write some programs, so he's all for reintroducing it. And that is the whole truth of the matter. Although I disagree with him, I appreciate his honesty.

If you have an example of me lying (even if it's just a possibility), please refer to it directly so I can either try to explain the misunderstanding or apologize.

...

But people keep posting "we don't want Python 2's confounding of text and binary, we just want bytes with (nearly) all the functionality of strings [because they are (partially|really) encoded text]". Some of them actually use the literal word "text" in their justification!

In only one case did I use the word "text" loosely, and that was when I claimed that Py2 had three text types, and Py3 had two. I was wrong, I apologize. Py3 has one definite text type, str, and, I claim, one half text type in bytes, because bytes itself provides ASCII text processing methods. If you have a better term for the notion of b'ethan'.title() --> b'Ethan' than ASCII-text processing, I'll use that instead. If there are good reasons to not allow further concessions to the ASCII-ness of bytes (and you provide a good one below) then that makes living with the handicap easier. But don't lie to me (as Nick tried to) and say that "In particular, the bytes type is, and always will be, designed for pure binary manipulation" when it has methods like .center(). If I am wrong, and that was not a lie, please explain it to me.

...

That's, well, what would you call it? Either they know what they're saying, in which case it's disingenuous at best, or they don't know what they're saying, in which case it's a proposal based on a clear misunderstanding of the situation.

I think some of the misunderstanding (which you also seem to suffer from) is that we (or at least I) /ever/ want a unicode string back from bytes interpolation. I don't! If I start with bytes, I want bytes back! And I have a very clear grasp on the difference between str and bytes and what ACSII encoding means, it was a hard and painful lesson for me and I'm not likely to forget it. To summarize, I used the term text when referring to unicode text (str), ASCII or ASCII-encoded text to refer to bytes that are to be used in a place that requires ASCII bytes for communication (such as content length or field type). I do /not/ use ASCII to refer to any ol' collection of bytes that happens to look like it might be ASCII-encoded text.

...

The problem is not going to go away just because they *say* they don't want to reintroduce Python 2 text processing. That is precisely what this proposal is *intended* to do, whether in the limited form proposed by Antoine or in the much more extensive form that folks like Ethan want.

What "maximalists" mean is that they promise not to abuse Python 2 text processing when writing Python 3 programs. This promise is highly unlikely to be kept for two reasons. First, they can't make that promise on behalf of third parties, who for various reasons certainly will abuse these features to avoid the encoded-text-to- Unicode-text and vice-versa conversions.

I concede that this is a good reason to not allow % interpolation. Kinda like not allowing sum on strings. And I don't make promises for other people, and abusing this feature would be a bug.

...

Second, I doubt they themselves will keep the promise to my satisfaction because their definition of "text" is ambiguous.

*My* definition is not ambiguous at all. If this particular part of the byte stream is defined to contain ASCII-encoded text, then I can use the bytes text methods to work with it. The only time I would return a bytes object is if it was supposed to be bytes (an image, for example); otherwise I return a bool, an int, a float, a date, or, even, a str.

...

When it's convenient for them to use text-processing operations on bytes, they'll say "oh, yes, these are conventionally considered text-processing features, but that's just an accident of the particular configuration of bytes -- yup, bytes -- I'm processing."

If that particular configuration of bytes is because it's ASCII-encoded text, then sure. To use, for example, bytes.__upper__ on data that wasn't ASCII-encoded text (even if it happened to look like it was) would be the height of stupidity. Please don't include me in such accusations.

...

But Nick's important example of web frameworks demonstrates the problem: unless they convert to text where appropriate, they're just pushing the problem off on application writers. Sometimes passing on data as bytes is appropriate, of course, but the framework authors are likely to be biased in favor of doing that, and it's not hard to imagine frameworks ported from Python 2 passing on the problem wholesale on the grounds that "we returned str in Python 2 which is bytes in Python 3, and since we were processing bytes the whole time, we see no reason to change the 'ABI'." Of course the application writers thought they were receiving text "in an inconvenient and ambiguous form". IMO, with the proposed changes, that is likely to continue indefinitely, negating some of the gains I expected to receive from Python 3. :-(

This would be a good reason to reject PEP 460, if that danger was deemed more likely than the good it would bring.

...

Note: there are a lot of high-level frameworks like Django that even in Python 2 basically went to Unicode everywhere internally. I don't deny that. I think that Python 3 as currently constituted makes it a lot easier to make an appropriate decision of where to convert, and should take some of the burden off the high-level frameworks. Approving this PEP, especially in a maximalist form, will blur the lines.

I understand your point, but I disagree. When I open a file (in binary mode, obviously, as otherwise I'd get massive corruption) I get back a bunch of bytes. When working with tcp, I get back a bunch of bytes. bytes are /already/ the boundary type. If we have to make a third type for proper boundary processing it's an admission that bytes failed in its role. -- ~Ethan~ [1] I double-checked all my posts on this topic both here and on Python Ideas to make sure.

Antoine Pitrou

10:52 p.m.

Hi Ethan, On Sun, 12 Jan 2014 13:28:15 -0800 Ethan Furman <ethan@stoneleaf.us> wrote:

...

On 01/12/2014 12:02 PM, Stephen J. Turnbull wrote:

...
Georg Brandl writes:

...
Antoine writes:

...
. . . if it weren't for your stupid maximalist opposition. . .

Can you please stop throwing personal insults around? You don't have to resort to that level.

Ethan's posts (as an example of one general trend in this thread) are pretty frustrating, you have to admit.

Two points:

1) Are you saying it's okay to be insulting when frustrated? I also find this mega-thread frustrating, but I'm trying very hard not to be insulting.

You are right, it is not ok. The wording wasn't constructive or controlled at all. I'd like to apologize for that. At the same point, I was expressing a fair amount of frustration. I think the last discussion rounds have largely failed to produce any new meaningful insight (to the point that I've stopped reading several subthreads). IMO the best thing *for now* would be to "agree to disagree", let things bake in everyone's mind for some time, and revisit the subject in some weeks. Regards Antoine.

Ethan Furman

11:15 p.m.

On 01/12/2014 02:52 PM, Antoine Pitrou wrote:

...

You are right, it is not ok. The wording wasn't constructive or controlled at all. I'd like to apologize for that.

Thank you. Apology accepted.

...

At the same point, I was expressing a fair amount of frustration. I think the last discussion rounds have largely failed to produce any new meaningful insight (to the point that I've stopped reading several subthreads). IMO the best thing *for now* would be to "agree to disagree", let things bake in everyone's mind for some time, and revisit the subject in some weeks.

For the most part I agree. I did, though, finally figure out what Nick thought I wanted, so there was at least a little progress. But yes, I think tabling the discussion for now, and working on Brett's ideas, is entirely appropriate. -- ~Ethan~ P.S. Direct reply so you don't miss my response. :)

Stephen J. Turnbull

3:02 a.m.

Ethan Furman writes:

...

1) Are you saying it's okay to be insulting when frustrated? I also find this mega-thread frustrating, but I'm trying very hard not to be insulting.

OK, no. Understandable, yes.

...

2) If you are going to use my name, please be certain of the facts [1]. More below.

...
MAL posted straight out the Python 2 model of text makes it easier for him to write some programs, so he's all for reintroducing it. And that is the whole truth of the matter. Although I disagree with him, I appreciate his honesty.

If you have an example of me lying (even if it's just a possibility), please refer to it directly so I can either try to explain the misunderstanding or apologize.

Praising one person for honesty doesn't imply anybody else is lying. As for the Artist Currently Posting as Ethan Furman, he's not in the "disingenous" group. I don't think you understand the issues at stake (among other things, as I've discussed elsewhere, I think your use case is different from the use cases of most of those who are asking for bytes formatting). And there's a crucial terminology difference:

...

In only one case did I use the word "text" loosely,

...

From my point of view, you consistently do so. Bytes are *never* Python 3 text in my terminology, and I think that is generally accepted on these channels. "ASCII-encoded text" as you call it (and repeatedly do so), and want to manipulate using str-like methods on bytes, is *exactly* the Python 2 model of text. But you deny that the effect of your proposals (eg, b"%d" % (12,)) is to reintroduce Python 2's bytes/character confusion, don't you?

Yes, I've used "ASCII-compatible text" in some of my posts, but I recognize that as "loose usage", too, and would stop if requested. Note I'm not asking you to stop -- I think we all understand what you mean, even though for some of us it's loose terminology. What I do hope you will recognize is that adding str-like methods to bytes is precisely the Python 2 model of text processing[1], and that like MAL you will say, "OK, I don't see a problem with reintroducing Python 2's byte/character confusion." (Well, I *really* want you to see the light, and retract your proposal for b'%d' format. But that hardly seems likely. :-)

...

But don't lie to me (as Nick tried to) and say that "In particular, the bytes type is, and always will be, designed for pure binary manipulation" when it has methods like .center().

I hardly think Nick is *lying*, any more than you are. AFAICT, you're *both* wrong. According to PEP 3137[2] by Guido van Rossum, the idea of the immutable bytes type was suggested (in various aspects which combined to overcome Guido's initial opposition) by Gregory P. Smith, Jeffrey Yasskin, and Talin. Guido then chose to implement it by grabbing the Python 2 code, and removing .encode, and removing locale-dependent definitions of character classes. This was with a view to supporting ports of code that implements wire protocols or uses bytes as encoded text: It also makes it possible to efficiently create hash tables using bytes for keys; this may be useful when parsing protocols like HTTP or SMTP which are based on bytes representing text. Porting code that manipulates binary data (or encoded text) in Python 2.x will be easier using the new design than using the original 3.0 design with mutable bytes; simply replace str with bytes and change '...' literals into b'...' literals. IIRC, only later was regex support added to bytes (by Nick himself, again IIRC). And despite the quote above, I don't think Guido meant to encourage use of bytes as text in wire protocol development, at least not at that time. Note that Nick has already admitted that permitting even methods that can be implemented purely as numerical manipulations: def is_uppercase(b): # Note all comparisons are between integers: return ord('A') <= b[0] and b[0] <= ord('Z') was in retrospect a mistake (in his opinion). So I don't think it was a lie, merely a difference in your definitions of "pure binary manipulation". (Which isn't surprising, given that ultimately everything in computers as we know them today eventually reduces to "pure binary manipulations".[3] Drawing the line is going to involve personal taste to some extent.) I think his interpretation that bytes were *designed* that way is a bit strained given PEP 3137. I also don't know what was discussed at language summits, and don't recall the python-dev conversations about it at all. A final remark: Be very careful in interpreting Guido's words in these "practical vs. pure" matters. I've discovered his offhand comments on these matters are often both subtle and deep (that probably doesn't surprise you), and that the idea behind them is usually extremely precise though his expression may informal or even casual (and here be dragons -- taking the expression too literally may lead you astray).

...

I think some of the misunderstanding (which you also seem to suffer from) is that we (or at least I) /ever/ want a unicode string back from bytes interpolation. I don't!

Please tell me why you think I suffer from that misunderstanding. I certainly don't think you *want* Unicode strings. You've been quite strident about the fact that you don' need no steekin' yooneekode (for these purposes). What I want to find out is why your use case can't be handled with Python 3 str. That's why I provide examples (mostly parallel to yours) that return str in Python 3 (I can't speak for anyone else).

...

To summarize, I used the term text when referring to unicode text (str), ASCII or ASCII-encoded text to refer to bytes that are to be used in a place that requires ASCII bytes for communication (such as content length or field type).

I've never been confused about that, but your use of the word "text" in a way differently from others in the thread seems to confuse you about what *they* mean. But did you get that I'm worried that programmers in Omaha will use that same functionality to communicate American English (for which it is basically sufficient, and which also requires ASCII when bytes are used for communication)?

...

*My* definition is not ambiguous at all. If this particular part of the byte stream is defined to contain ASCII-encoded text, then I can use the bytes text methods to work with it.

But how is Python supposed to know that? The point of having types in a programming language is so that either the interpreter can just DTRT, or raise an exception if TRT is ambiguous, without explicit specification by the programmer. This is precisely what asciistr is for: it knows that it is both unicode and bytes compatible, and morphs automatically to whichever it is combined with. And does so efficiently (because they're all immutable, any combination of these types in Python involves copying "code units", and for asciistr that copy is always of bytes, thus reducing eventually to memcpy for bytes and latin1-only str). But under your definition, you need to make the decision, or explicitly code the decision, on the basis of context.

...

...
When it's convenient for them to use text-processing operations on bytes, they'll say "oh, yes, these are conventionally considered text-processing features, but that's just an accident of the particular configuration of bytes -- yup, bytes -- I'm processing."

If that particular configuration of bytes is because it's ASCII-encoded text, then sure.

Once again, you are advocate precisely the Python 2 model of text.

...

To use, for example, bytes.__upper__ on data that wasn't ASCII-encoded text (even if it happened to look like it was) would be the height of stupidity. Please don't include me in such accusations.

I have no idea why you think I think anybody would be that stupid. That never occured to me. It's precisely "magic numbers" that happen to look like English words when interpreted as ASCII coded characters that I don't want manipulated by str-like methods that interpret text (such as full-featured format or %). If b"Content-Length: 123" is (ASCII-encoded) text, then it should be created as, or decoded to, internal text and handled that way. If it's binary, then handle it as binary.

...

...
ambiguous form". IMO, with the proposed changes, that is likely to continue indefinitely, negating some of the gains I expected to receive from Python 3. :-(

This would be a good reason to reject PEP 460, if that danger was deemed more likely than the good it would bring.

Depends on which version. I earlier opposed PEP 460 in any form, but I'm persuaded by Nick's particular definition of "pure binary manipulation" and agree that PEP 460 as revised by Antoine is harmless to my goals. Although I personally am unlikely to find any great convenience from it (both as a matter of style and to a great extent a lack of use cases, although I'd like to get involved in the email module).

...

...
Note: there are a lot of high-level frameworks like Django that even in Python 2 basically went to Unicode everywhere internally. I don't deny that. I think that Python 3 as currently constituted makes it a lot easier to make an appropriate decision of where to convert, and should take some of the burden off the high-level frameworks. Approving this PEP, especially in a maximalist form, will blur the lines.

I understand your point, but I disagree. When I open a file (in binary mode, obviously, as otherwise I'd get massive corruption)

Obviously, *you* would open the file in binary mode, but by definition of the latin1 codec and the surrogateescape handler, *I* can definitely avoid any corruption when reading such files as text. (This may require painful contortions if one does any nontrivial processing, but then again it may not.)

...

I get back a bunch of bytes. When working with tcp, I get back a bunch of bytes. bytes are /already/ the boundary type.

No, they are not. Clearly there are "just bytes" on the "outside" of I/O in each of your examples here, and they are "just copied" to the inside of Python. But in Nick's sense, this is the "outside," *not* the "inside", of your program! On the "inside", *you* want "a bool, an int, a float, a date, or, even, a str" (I'm quoting!). What Nick means by a "boundary type" is a type that works seamlessly with the types on each side of the boundary as a helper in the conversion. So when you use a struct to pack a bool, an int, and a date into a bytes, the struct is the boundary type. And if there's a helper type to work with bytes and/or str simultaneously, that's a boundary type, eg, asciistr. But bytes itself is not a boundary type, it's just a type with no internal structure, not even characters.

...

If we have to make a third type for proper boundary processing it's an admission that bytes failed in its role.

That admission was made in PEP 3100. Or, more precisely, bytes was never considered as a boundary type in Python 3. Footnotes: [1] To be precise, one of two models, the other one being the unicode type. [2] http://www.python.org/dev/peps/pep-3137/ [3] OK, OK, I still have my Daddy's K&E loglog slide rule. Not *everything* is binary!

Ethan Furman

4 a.m.

On 01/12/2014 07:02 PM, Stephen J. Turnbull wrote: [snip most of very eloquent reply] Thank you, Stephen, for remaining calm despite my somewhat heated response. A few comments in-line. I now better understand your viewpoint about text always being unicode strings; I just happen to disagree. Hopefully as some consolation I will be very vocal about using str unless bytes is necessary. Any application that uses text should be using str for it, and only using bytes, if necessary, on the back-end.

...

Ethan Furman writes:

...
In only one case did I use the word "text" loosely,

[...] Bytes are *never* Python 3 text in my terminology [...] "ASCII-encoded text" as you call it [...] and want to manipulate using str-like methods on bytes

The part that you don't seem to acknowledge (sorry if I missed it) is that there are str-like methods already on bytes. While the actual implementation of isupper (your example from below) may be done using integer methods, it only makes semantic sense if interpreted as ASCII-encoded text.

...

is *exactly* the Python 2 model of text. But you deny that the effect of your proposals (eg, b"%d" % (12,)) is to reintroduce Python 2's bytes/character confusion, don't you?

Given that the default (and only) text type in Py3 is str, which is unicode, I don't think any confusion will be as severe, but I acknowledge that there could be some.

...

I hardly think Nick is *lying*, any more than you are. AFAICT, you're *both* wrong.

LOL, well, at least I'm in good company, then! :)

...

...
I think some of the misunderstanding (which you also seem to suffer from) is that we (or at least I) /ever/ want a unicode string back from bytes interpolation. I don't!

Please tell me why you think I suffer from that misunderstanding.

I no longer recall, but whatever misapprehension I was suffering from you have alleviated. (That sentence would make my daughter pround! English major. ;)

...

But did you get that I'm worried that programmers in Omaha will use that same functionality to communicate American English (for which it is basically sufficient, and which also requires ASCII when bytes are used for communication)?

Yes, I get that. Hopefully their friends and neighbors will slap them with fishes if they do.

...

...
*My* definition is not ambiguous at all. If this particular part of the byte stream is defined to contain ASCII-encoded text, then I can use the bytes text methods to work with it.

But how is Python supposed to know that?

Python doesn't need to. bytes is a low-level object -- it could contain music, movies, dbf data, pdf data, or my mothers cheesecake recipe (properly encoded, of course). Python can't protect me from treating a music file as if it were a movie file, or even just writing proper music info at the wrong place in the music file; all that is up to me, as the programmer, to get right, and to understand what is needed.

...

But under your definition, you need to make the decision, or explicitly code the decision, on the basis of context.

Exactly so. I even have to do that in Py2.

...

...
If that particular configuration of bytes is because it's ASCII-encoded text, then sure.

Once again, you are advocate precisely the Python 2 model of text.

Not exactly, because what I get back is bytes, which cannot directly be mixed with unicode (str) as it was in Py2. I think this is a key difference.

...

...
To use, for example, bytes.__upper__ on data that wasn't ASCII-encoded text (even if it happened to look like it was) would be the height of stupidity. Please don't include me in such accusations.

I have no idea why you think I think anybody would be that stupid. That never occured to me. It's precisely "magic numbers" that happen to look like English words when interpreted as ASCII coded characters that I don't want manipulated by str-like methods that interpret text (such as full-featured format or %).

This confuses me somewhat. It's okay to use b'ethan'.upper(), which only makes semantic sense as ASCII-encoded text, but b'age: %d' % 43 isn't? (Aside, I'm perfectly comfortable with "ASCII-encoded text" because if you took u'ethan'.encode('ascii') you would get b'ethan'. If it was some other encoding, such as cp1251, I would call that particular byte stream "cp1251-encoded text". And if there were methods that worked directly on a cp1251-encoded byte stream I would not have any problem using them on cp1251-encoded text.)

...

What Nick means by a "boundary type" is a type that works seamlessly with the types on each side of the boundary as a helper in the conversion. So when you use a struct to pack a bool, an int, and a date into a bytes, the struct is the boundary type. And if there's a helper type to work with bytes and/or str simultaneously, that's a boundary type, eg, asciistr. But bytes itself is not a boundary type, it's just a type with no internal structure, not even characters.

Hmmm. I'll have to think about this. Okay, I've thought somewhat. Under the definition above would it be fair to say that Db3Table (a class in my dbf module) is a boundary type? It sits between the actual file and the program, and transforms bytes into actual Python types. -- ~Ethan~

Glenn Linderman

6:06 a.m.

On 1/12/2014 8:00 PM, Ethan Furman wrote:

...

Okay, I've thought somewhat. Under the definition above would it be fair to say that Db3Table (a class in my dbf module) is a boundary type? It sits between the actual file and the program, and transforms bytes into actual Python types.

Yes. That is exactly what a boundary type is. It doesn't matter whether it is a file format or a wire protocol format on the non-Python side, the sequence of bytes is defined, using methods that are not directly corresponding to python data types (if they do correspond, the boundary type is trivial).

Stephen J. Turnbull

10:48 a.m.

Ethan Furman writes:

...

The part that you don't seem to acknowledge (sorry if I missed it) is that there are str-like methods already on bytes.

I haven't expressed myself well, but I don't much care about that. It's what Knuth would classify as a seminumerical method. What I do care about is that the methods that convert other types to text (including format) not work for bytes. That's where I consider text to "start".

...

...
is *exactly* the Python 2 model of text. But you deny that the effect of your proposals (eg, b"%d" % (12,)) is to reintroduce Python 2's bytes/character confusion, don't you?

Given that the default (and only) text type in Py3 is str, which is unicode, I don't think any confusion will be as severe, but I acknowledge that there could be some.

I fear it will be quite severe where I live, in Shift JIS/GB18030 land. (The two most obnoxious encodings known to man, except perhaps the syntax of Brainf!ck.)

...

...
...
*My* definition is not ambiguous at all. If this particular part of the byte stream is defined to contain ASCII-encoded text, then I can use the bytes text methods to work with it.

But how is Python supposed to know that?

Python doesn't need to.

... because you know it. But the ideal of object-oriented programming (and duck-typing) is that you shouldn't need to; the object should know how to produce appropriate behavior itself.

...

...
But under your definition, you need to make the decision, or explicitly code the decision, on the basis of context.

Exactly so. I even have to do that in Py2.

"Even." This is exactly where PBP and EIBTI part company, I think. EIBTI thinks its a bad idea to pass around bytes that are implicitly some other type, and Python 3 *should be good enough to make that unnecessary*. I'm convinced, and Nick is convinced, that we can make that true for 90% of the cases that it isn't now, if we could just figure out what's hard about the use cases where Python 3 isn't up to snuff yet (and figure out which use cases we need to handle to get us up to 90%!) PBP doesn't think it's a great idea to pass around bytes that are implicitly some other type, but didn't mind it (or got used to it) in Python 2, and so they're not looking at that as a problem that Python 3 can solve. They're looking at Python 3 as the problem that prevents them from doing what worked fine in Python 2. I understand that point of view, I just think we should be able to do better in Python 3, and should give it a serious try before giving in. Remember, "Special cases aren't special enough to break the rules" comes *before* "Although practicality beats purity". Not to forget that "Explicit is better than implicit" is second[1] on the list. ;-) After looking at this thread, I feel that (due to misunderstandings on both sides) purity hasn't really been tried yet.

...

...
...
If that particular configuration of bytes is because it's ASCII-encoded text, then sure.

Once again, you are advocate precisely the Python 2 model of text.

Not exactly, because what I get back is bytes, which cannot directly be mixed with unicode (str) as it was in Py2. I think this is a key difference.

You're in good company there; that was Guido's rationale for not worrying, too. I agree it's "key" (and I'm sure Nick will, on reflection if not already). But I worry (a lot) that it's not enough.

...

This confuses me somewhat. It's okay to use b'ethan'.upper(), which only makes semantic sense as ASCII-encoded text,

Not really OK. In theory, because it doesn't require serialization/ encoding of a primitive type, it doesn't matter. In practice, without powerful formatting, it isn't even a major attraction. In practice, with powerful formatting, it adds to the attraction. Note that regex doesn't require type conversions (matches have methods to return positions in the target or subsequences of the target, not values of other types), which is why I (and I suspect Nick for the same reason) am comfortable with polymorphic regex but not with bytes formatting.

...

(Aside, I'm perfectly comfortable with "ASCII-encoded text" because if you took u'ethan'.encode('ascii') you would get b'ethan'. If it was some other encoding, such as cp1251, I would call that particular byte stream "cp1251-encoded text".

Even though "ethan" is perfectly good ASCII-encoded text (as well as the integer 435,744,694,638 on a bigendian machine with 5-byte words, and you have no way of knowing whether it was user data (CP1251) or a metadata keyword (ASCII) or be the US national debt in 1967 dollars (integer) when b'ethan' shows up in a trace?

...

And if there were methods that worked directly on a cp1251-encoded byte stream I would not have any problem using them on cp1251-encoded text.)

I was afraid of that: all of those methods (except the case methods[2]) will work fine on a cp1251-encoded text. And because they only know that the string is bytes, the case methods will silently corrupt your "text" as soon as they get a chance. That bothers me, even if it doesn't bother you. Purity again, if you like. (But you'd take a safe .upper if you got it for free, no?)

...

Okay, I've thought somewhat. Under the definition above would it be fair to say that Db3Table (a class in my dbf module) is a boundary type? It sits between the actual file and the program, and transforms bytes into actual Python types.

Yes, I'd call that a boundary type. Footnotes: [1] Yes, I know what's number 1, but I'm not going to mention it out loud! [2] Arguably those too, since bytes don't have a locale. They're in C locale and the bytes >127 don't have semantics like case.

Ethan Furman

4:30 p.m.

On 01/13/2014 02:48 AM, Stephen J. Turnbull wrote:

...

Ethan Furman writes:

...
The part that you don't seem to acknowledge (sorry if I missed it) is that there are str-like methods already on bytes.

I haven't expressed myself well, but I don't much care about that.

You don't care that there are str-like methods on bytes? Whether you do or not, they are there, and they impact how people think about bytes and what is (and what should be) allowed.

...

It's what Knuth would classify as a seminumerical method.

I do not see how that's relevant. What matters is not how we can manipulate the data (everything is reduced to numbers at some point), but what the data represents. [snip]

...

...
...
...
*My* definition is not ambiguous at all. If this particular part of the byte stream is defined to contain ASCII-encoded text, then I can use the bytes text methods to work with it.

But how is Python supposed to know that?

Python doesn't need to.

... because you know it. But the ideal of object-oriented programming (and duck-typing) is that you shouldn't need to; the object should know how to produce appropriate behavior itself.

The ideal, sure. But if you're stuck with using a list to hold data for your higher-order recursive function are you going to expect the list data type to "know" which pops and inserts are allowed and which are not? Of course not. And you'd probably build a proper class on top of the list so those things could be checked. Now imagine that the list type didn't offer insert and pop, and you had to use slice replacement -- what a pain that would be! [snip]

...

...
...
But under your definition, you need to make the decision, or explicitly code the decision, on the basis of context.

Exactly so. I even have to do that in Py2.

"Even." This is exactly where PBP and EIBTI part company, I think. EIBTI thinks its a bad idea to pass around bytes that are implicitly some other type

bytes are /always/ implicitly some other type. They are basically raw data. They are given meaning by how we interpret them. [snip]

...

Even though "ethan" is perfectly good ASCII-encoded text (as well as the integer 435,744,694,638 on a bigendian machine with 5-byte words, and you have no way of knowing whether it was user data (CP1251) or a metadata keyword (ASCII) or be the US national debt in 1967 dollars (integer) when b'ethan' shows up in a trace?

Context is everything. If b'ethan' shows up in a trace I would have to examine the surrounding code to see how those bytes were being used.

...

...
And if there were methods that worked directly on a cp1251-encoded byte stream I would not have any problem using them on cp1251-encoded text.)

I was afraid of that: all of those methods (except the case methods) will work fine on a cp1251-encoded text.

Really? Huh. They wouldn't work fine with the Spanish alphabet. I should've used that for my example. :/

...

And because they only know that the string is bytes, the case methods will silently corrupt your "text" as soon as they get a chance.

Inevitably there are methods that will "work" even if given the wrong data type, while others will either corrupt or blow up if not given exactly what they expect. You tell me that some ASCII methods will work okay on cp1251 text, and others will not. So I'm not going to use any of them on cp1251 as that is not what they are intended for.

...

That bothers me, even if it doesn't bother you. Purity again, if you like. (But you'd take a safe .upper if you got it for free, no?)

Well, there is no such thing as free. ;) And there already is a safe .upper -- str.upper. And if I don't know that my bytes are ASCII, but I did know they were text, I wouldn't use ASCII methods, I'd convert to str and work there. -- ~Ethan~

Greg Ewing

12:06 a.m.

Stephen J. Turnbull wrote:

...

PBP doesn't think it's a great idea to pass around bytes that are implicitly some other type, but didn't mind it (or got used to it) in Python 2, and so they're not looking at that as a problem that Python 3 can solve. They're looking at Python 3 as the problem that prevents them from doing what worked fine in Python 2.

While some people may think that way, I don't think it's fair to characterise *all* proponents of bytes formatting as luddites that refuse to get with the Python 3 way. Some of us *do* understand the principles of text/ bytes separation in Python 3 and agree that they're a good idea. We just don't agree that the proposed formatting operations violate those principles to any degree worth worrying about. I don't think of my viewpoint as being PBP. That term assumes there is purity there to be beaten. To my mind, any notion of purity with respect to bytes objects went out the window as soon as it was given a pile of text methods -- together with a text-like literal syntax and default repr(), even though at least half the time they're completely inappropriate! -- Greg

Emile van Sebille

1:09 a.m.

On 1/13/2014 4:06 PM, Greg Ewing wrote: <snip>

...

of text methods -- together with a text-like literal syntax and default repr(), even though at least half the time they're completely inappropriate!

Better said as 'half the time they're coincidentally helpful!' My $.01 :) Emile

Stephen J. Turnbull

5:15 a.m.

Greg Ewing writes:

...

I don't think of my viewpoint as being PBP. That term assumes there is purity there to be beaten. To my mind, any notion of purity with respect to bytes objects went out the window as soon as it was given a pile of text methods -- together with a text-like literal syntax and default repr(), even though at least half the time they're completely inappropriate!

Isn't an analogous statement true of every programming language taken as a whole? Does that mean that, because Python 1 text handling was unavoidably "practical", adding the "purist" unicode type in Python 2 was a mistake? Python 3's sacrifice of Python 2 compatibility seems positively degenerate by your standard! To be less contentious, surely the concept of "purity" includes "purification" (even if that doesn't apply to some subdivisions of purity)? In any case, taking your statement at face value, I consider adding the methods to have been a mistake and the literal syntax and repr to be compact abbreviations that are frequently convenient. Byte sequences that can be considered to serializations of objects including ASCII text, so that some subsequences of bytes in the range 0-127 can be usefully considered as a text representation are very common. But I think that they're important enough that their representation in Python deserves a type (maybe more than one) that tries to enforce what regularities there are in such streams. The purity position is probably going to lose in the end, since Guido is clearly in the PBP camp at this point, and that's a strong indicator (especially since Nick has given up on convincing python-dev). But that does not mean it's entirely invalid.

Nick Coghlan

5:34 a.m.

On 14 January 2014 15:15, Stephen J. Turnbull <stephen@xemacs.org> wrote:

...

The purity position is probably going to lose in the end, since Guido is clearly in the PBP camp at this point, and that's a strong indicator (especially since Nick has given up on convincing python-dev). But that does not mean it's entirely invalid.

I didn't give up regarding PEP 460 - Guido pointed out an error in my assumptions that made my position invalid, and his correct. "Give up" makes it sound like I got tired of arguing without being convinced rather than admitting I was just plain wrong. While I'll still work on the asciistr proposal, that's unrelated to PEP 460 - it's about making hybrid APIs less painful to write in Python 3 when you're willing to place the burden of ensuring ASCII compatibility of binary data on the calling code. That kind of thing is likely to be a reasonable approach in specific domains (when writing a web development framework, for example), even though I think it's an *in*appropriate design for the standard library. PEP 460 should actually make asciistr easier in the long run, as I now expect we'll run into some "interesting" issues getting formatting to produce anything other than text (contrary to what I said elsewhere in these threads - I hadn't thought through the full implications at the time). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Guido van Rossum

6:04 a.m.

On Mon, Jan 13, 2014 at 9:34 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:

...

On 14 January 2014 15:15, Stephen J. Turnbull <stephen@xemacs.org> wrote:

...
The purity position is probably going to lose in the end, since Guido is clearly in the PBP camp at this point, and that's a strong indicator (especially since Nick has given up on convincing python-dev). But that does not mean it's entirely invalid.

I didn't give up regarding PEP 460 - Guido pointed out an error in my assumptions that made my position invalid, and his correct. "Give up" makes it sound like I got tired of arguing without being convinced rather than admitting I was just plain wrong.

Thanks for that. (I was worried when I saw your first huge post in the reboot thread.)

...

While I'll still work on the asciistr proposal, that's unrelated to PEP 460 - it's about making hybrid APIs less painful to write in Python 3 when you're willing to place the burden of ensuring ASCII compatibility of binary data on the calling code. That kind of thing is likely to be a reasonable approach in specific domains (when writing a web development framework, for example), even though I think it's an *in*appropriate design for the standard library.

I've now looked at asciistr. (Thanks Glenn and Ethan for the link.) Now that I (hopefully) understand it, I'm worried that a text processing algorithm that uses asciistr might under hard-to-predict circumstances (such as when the arguments contain nothing of interest to the algorithm) might return an asciistr instance instead of a str or bytes instance, and this might confuse a caller (e.g. isinstance() checks might fail, dict lookups, or whatever -- it feels like the problem is similar to creating the perfect proxy type).

...

PEP 460 should actually make asciistr easier in the long run, as I now expect we'll run into some "interesting" issues getting formatting to produce anything other than text (contrary to what I said elsewhere in these threads - I hadn't thought through the full implications at the time).

For example? -- --Guido van Rossum (python.org/~guido)

Nick Coghlan

7:44 a.m.

On 14 January 2014 16:04, Guido van Rossum <guido@python.org> wrote:

...

On Mon, Jan 13, 2014 at 9:34 PM, Nick Coghlan <ncoghlan@gmail.com> wrote: I've now looked at asciistr. (Thanks Glenn and Ethan for the link.)

Now that I (hopefully) understand it, I'm worried that a text processing algorithm that uses asciistr might under hard-to-predict circumstances (such as when the arguments contain nothing of interest to the algorithm) might return an asciistr instance instead of a str or bytes instance, and this might confuse a caller (e.g. isinstance() checks might fail, dict lookups, or whatever -- it feels like the problem is similar to creating the perfect proxy type).

Right, asciistr is designed for a specific kind of hybrid API where you want to accept binary input (and produce binary output) *and* you want to accept text input (and produce text output). Porting those from Python 2 to Python 3 is painful not because of any limitations of the str or bytes API but because it's the only use case I have found where I actually *missed* the implicit interoperability offered by the Python 2 str type. It's not an implementation style I would consider appropriate for the standard library - we need to code very defensively in order to aid debugging in arbitrary contexts, so I consider having an API like urllib.parse demand 7-bit ASCII in the binary version, and require text to handle impure input to be a better design choice. However, in an environment where you can place greater preconditions on your inputs (such as "ensure all input data is ASCII compatible") and you're willing to tolerate the occasional obscure traceback for particular kinds of errors, then it should be a convenient way to use common constants (like separators or URL scheme names) in an algorithm that can manipulate either binary or text, but not a combination of the two (the latter is still a nice improvement in correctness over Python 2, which allowed them to be mixed freely rather than requiring consistency across the inputs). It's still slightly different from Python 2, though. In Python 2, the interaction model was: str & str -> str str & unicode -> unicode (with the one exception being str.format: that consistently produces str rather than promoting to Unicode) My goal for asciistr is that it should exhibit the following behaviour: str & asciistr -> str asciistr & asciistr -> str (making it asciistr would be a pain and I don't have a use case for that) bytes & asciistr -> bytes So in code like that in urllib.parse (but in a more constrained context), you could just switch all your constants to asciistr, change your indexing operations to length 1 slices and then in theory essentially the same code that worked in Python 2 should also work in Python 3. However, Benno is finding that my warning about possible interoperability issues was accurate - we have various places where we do PyUnicode_Check() rather than PyUnicode_CheckExact(), which means we don't always notice a PEP 3118 buffer interface if it is provided by a str subclass. We'll look at those as we find them, and either work around them (if we can), decide not to support that behaviour in asciistr, or else I'll create a patch to resolve the interoperability issue. It's not necessarily a type I'd recommend using in production code, as there *will* always be a more explicit alternative that doesn't rely on a tricksy C extension type that only works in CPython. However, it's a type I think is worth having implemented and available on PyPI, even if it's just to disprove the claim that you *can't* write that kind of code in Python 3.

...

...
PEP 460 should actually make asciistr easier in the long run, as I now expect we'll run into some "interesting" issues getting formatting to produce anything other than text (contrary to what I said elsewhere in these threads - I hadn't thought through the full implications at the time).

For example?

asciistr is a str subclass, so its formatting methods currently operate in the text domain and produce str output. Getting it to do otherwise is actually a task on the scale of implementing ASCII interpolation operations on the native bytes type. This realisation was the *other* factor that made me more comfortable with the idea of adding ASCII interpolation to the core bytes type - I previously thought asciistr could easily handle it, but it doesn't (except in the pure ASCII case where it could theoretically just encode at the end), thus also knocking out my "we can easily do this in an extension type, there's no need to provide it in the builtins" argument. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Guido van Rossum

6:16 p.m.

[Other readers: asciistr is at https://github.com/jeamland/asciicompat] On Mon, Jan 13, 2014 at 11:44 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:

...

Right, asciistr is designed for a specific kind of hybrid API where you want to accept binary input (and produce binary output) *and* you want to accept text input (and produce text output). Porting those from Python 2 to Python 3 is painful not because of any limitations of the str or bytes API but because it's the only use case I have found where I actually *missed* the implicit interoperability offered by the Python 2 str type.

Yes, the use case is clear.

...

It's not an implementation style I would consider appropriate for the standard library - we need to code very defensively in order to aid debugging in arbitrary contexts, so I consider having an API like urllib.parse demand 7-bit ASCII in the binary version, and require text to handle impure input to be a better design choice.

This surprises me. I think asciistr should strive to be useful for the stdlib as well.

...

However, in an environment where you can place greater preconditions on your inputs (such as "ensure all input data is ASCII compatible")

That gives me the Python 2 willies. :-(

...

and you're willing to tolerate the occasional obscure traceback for particular kinds of errors,

Really? Can you give an example where the traceback using asciistr() would be more obscure than using the technique you used in urllib.parse?

...

then it should be a convenient way to use common constants (like separators or URL scheme names) in an algorithm that can manipulate either binary or text, but not a combination of the two (the latter is still a nice improvement in correctness over Python 2, which allowed them to be mixed freely rather than requiring consistency across the inputs).

Unfortunately I suspect there are still examples where asciistr's "submissive" behavior can produce surprises. E.g. consider a function of two arguments that must either be both bytes or both str. It's easily conceivable that for certain combinations of incorrect arguments (i.e. one bytes and one str) the function doesn't raise an error but returns something of one or the other type. (And this is exactly the Python 2 outcome we're trying to avoid.)

...

It's still slightly different from Python 2, though. In Python 2, the interaction model was:

str & str -> str str & unicode -> unicode

(with the one exception being str.format: that consistently produces str rather than promoting to Unicode)

Or raises good old UnicodeError. :-(

...

My goal for asciistr is that it should exhibit the following behaviour:

str & asciistr -> str asciistr & asciistr -> str (making it asciistr would be a pain and I don't have a use case for that)

I almost had one in the example code I sent in response to Greg.

...

bytes & asciistr -> bytes

I understand that '&' here stands for "any arbitrary combination", but what about searches? Given that asciistr's base class is str, won't it still blow up if you try to use it as an argument to e.g. bytes.startswith()? Equality tests also sound problematic; is b'x' == asciistr('x') == 'x' ???

...

So in code like that in urllib.parse (but in a more constrained context), you could just switch all your constants to asciistr, change your indexing operations to length 1 slices and then in theory essentially the same code that worked in Python 2 should also work in Python 3.

The more I think about this, the less I believe it's that easy. I suspect you had the right idea when you mentioned singledispatch. It might be easier to write the bytes version in terms of the string versions wrapped in decode/encode, or vice versa, rather than trying to reason out all the different combinations of str, bytes, asciistr.

...

However, Benno is finding that my warning about possible interoperability issues was accurate - we have various places where we do PyUnicode_Check() rather than PyUnicode_CheckExact(), which means we don't always notice a PEP 3118 buffer interface if it is provided by a str subclass.

Not sure I understand this, but I believe him when he says this won't be easy.

...

We'll look at those as we find them, and either work around them (if we can), decide not to support that behaviour in asciistr, or else I'll create a patch to resolve the interoperability issue.

It's not necessarily a type I'd recommend using in production code, as there *will* always be a more explicit alternative that doesn't rely on a tricksy C extension type that only works in CPython. However, it's a type I think is worth having implemented and available on PyPI, even if it's just to disprove the claim that you *can't* write that kind of code in Python 3.

Hm. It is beginning to sound more and more flawed. I also worry that it will bring back the nightmare of data-dependent UnicodeError back. E.g. this (from tests/basic.py): def test_asciistr_will_not_accept_codepoints_above_127(self): self.assertRaises(ValueError, asciistr, 'Schrödinger') looks reasonable enough when you assume asciistr() is always used with a literal as argument -- but I suspect that plenty of people would misunderstand its purpose and write asciistr(s) as a "clever" way to turn a string into something that's compatible with both bytes and strings... :-( -- --Guido van Rossum (python.org/~guido)

Nick Coghlan

9:37 p.m.

On 15 Jan 2014 04:16, "Guido van Rossum" <guido@python.org> wrote:

...

[Other readers: asciistr is at https://github.com/jeamland/asciicompat]

On Mon, Jan 13, 2014 at 11:44 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:

...
Right, asciistr is designed for a specific kind of hybrid API where you want to accept binary input (and produce binary output) *and* you want to accept text input (and produce text output). Porting those from Python 2 to Python 3 is painful not because of any limitations of the str or bytes API but because it's the only use case I have found where I actually *missed* the implicit interoperability offered by the Python 2 str type.

Yes, the use case is clear.

...
It's not an implementation style I would consider appropriate for the standard library - we need to code very defensively in order to aid debugging in arbitrary contexts, so I consider having an API like urllib.parse demand 7-bit ASCII in the binary version, and require text to handle impure input to be a better design choice.

This surprises me. I think asciistr should strive to be useful for the stdlib as well.

The concerns you raise are the reason I'm not sure that's possible - just as in the Python 2 text model, I suspect actually *using* asciistr will trade ease of development against robust detection of input errors. I'm OK with that in a PyPI module, I'd be dubious about including it in the standard library and making it a builtin is right out.

...

...
However, in an environment where you can place greater preconditions on your inputs (such as "ensure all input data is ASCII compatible")

That gives me the Python 2 willies. :-(

Yep - from a formal correctness point of view, asciistr is a terrible idea. That's not the only consideration in coding though, or we'd all be using statically typed languages :)

...

...
and you're willing to tolerate the occasional obscure traceback for particular kinds of errors,

Really? Can you give an example where the traceback using asciistr() would be more obscure than using the technique you used in urllib.parse?

In urllib.parse I do an up front check that everything is consistently bytes or str. With asciistr it becomes tempting to skip that up front check, so you instead get a TypeError about not being able to add str and bytes. Technically you could keep that up front check and only use asciistr as an internal implementation detail, but at that point you may as well do things properly and write the algorithm to operate solely on bytes or str and convert the other inputs appropriately (which is the actual approach we use in the standard library).

...

...
then it should be a convenient way to use common constants (like separators or URL scheme names) in an algorithm that can manipulate either binary or text, but not a combination of the two (the latter is still a nice improvement in correctness over Python 2, which allowed them to be mixed freely rather than requiring consistency across the inputs).

Unfortunately I suspect there are still examples where asciistr's "submissive" behavior can produce surprises. E.g. consider a function of two arguments that must either be both bytes or both str. It's easily conceivable that for certain combinations of incorrect arguments (i.e. one bytes and one str) the function doesn't raise an error but returns something of one or the other type. (And this is exactly the Python 2 outcome we're trying to avoid.)

Yep - that's why I consider asciistr to be firmly in the "power tool" category. If you know what you're doing, it should let you write hybrid API code that is just as concise as Python 2, but it's also far more error prone than the core Python 3 text model. I admit that's a key part of my motivation in trying to help Benno to create it - I want to show that it's not that you *can't* write code that way in Python 3, it's that there are good reasons why you *shouldn't*. And in cases where those reasons don't apply... well, the aim in that case is "pip install asciicompat" and away you go :)

...

...
It's still slightly different from Python 2, though. In Python 2, the interaction model was:

str & str -> str str & unicode -> unicode

(with the one exception being str.format: that consistently produces str rather than promoting to Unicode)

Or raises good old UnicodeError. :-(

Unless Benno fixed it in the last couple of days (which seems unlikely given the complexity of the problem), asciistr currently has the Python 3 behaviour of interpolating the bytes repr() into the string rather than trying to decode it. That's a key reason why it likely *won't* be a substitute for PEP 460.

...

...
My goal for asciistr is that it should exhibit the following behaviour:

str & asciistr -> str asciistr & asciistr -> str (making it asciistr would be a pain and I don't have a use case for that)

I almost had one in the example code I sent in response to Greg.

...
bytes & asciistr -> bytes

I understand that '&' here stands for "any arbitrary combination", but what about searches? Given that asciistr's base class is str, won't it still blow up if you try to use it as an argument to e.g. bytes.startswith()? Equality tests also sound problematic; is b'x' == asciistr('x') == 'x' ???

Yes, the aim is to take advantage of the fact that bytes generally interoperates with anything that publishes a PEP 3118 buffer - the key feature of asciistr is that it publishes the 8-bit segment from PEP 393 as that buffer (the constructor checks that the max code point is 127 or less). It's very CPython specific due to the tinkering with str internals, but the idea is mostly to show that the semantics of such a type *can* still be expressed relatively sensibly in Python 3, it's just not an approach that's going to be applicable very often (most Python 3 native code will be able to choose to be a binary or text API, so the need for this kind of hybrid API design mostly affects APIs that started life in Python 2 and hence still need to support both use cases).

...

...
So in code like that in urllib.parse (but in a more constrained context), you could just switch all your constants to asciistr, change your indexing operations to length 1 slices and then in theory essentially the same code that worked in Python 2 should also work in Python 3.

The more I think about this, the less I believe it's that easy. I suspect you had the right idea when you mentioned singledispatch. It might be easier to write the bytes version in terms of the string versions wrapped in decode/encode, or vice versa, rather than trying to reason out all the different combinations of str, bytes, asciistr.

Yes - while I don't plan to *actually* switch the way urllib.parse works away from the current higher order function approach (it ain't broke, so there's nothing to fix), I do have a patch in progress that shows how it would look using single dispatch instead. Once I have that done, I'll post it somewhere as a demonstration and update my binary protocol essay to suggest the additional option of using single dispatch to process in the binary or text domain, with optional encoding and decoding steps controlled by the type of the first input. Also: after converting a function that takes a tuple where I wanted to dispatch on the type of the first element, I suspect supporting a "key=lambda args, kwds: type(args[0][0])" argument to singledispatch in Python 3.5 might be a reasonable idea. On the other hand, I haven't explored the possibility of a custom decorator yet, either, so we don't need to do anything hasty :)

...

...
However, Benno is finding that my warning about possible interoperability issues was accurate - we have various places where we do PyUnicode_Check() rather than PyUnicode_CheckExact(), which means we don't always notice a PEP 3118 buffer interface if it is provided by a str subclass.

Not sure I understand this, but I believe him when he says this won't be easy.

Essentially, we *want* bytes to see asciistr as a buffer exporter, but in a few places it goes "ah, a str subclass!" instead (which usually isn't what we want).

...

...
We'll look at those as we find them, and either work around them (if we can), decide not to support that behaviour in asciistr, or else I'll create a patch to resolve the interoperability issue.

It's not necessarily a type I'd recommend using in production code, as there *will* always be a more explicit alternative that doesn't rely on a tricksy C extension type that only works in CPython. However, it's a type I think is worth having implemented and available on PyPI, even if it's just to disprove the claim that you *can't* write that kind of code in Python 3.

Hm. It is beginning to sound more and more flawed. I also worry that it will bring back the nightmare of data-dependent UnicodeError back. E.g. this (from tests/basic.py):

def test_asciistr_will_not_accept_codepoints_above_127(self): self.assertRaises(ValueError, asciistr, 'Schrödinger')

looks reasonable enough when you assume asciistr() is always used with a literal as argument -- but I suspect that plenty of people would misunderstand its purpose and write asciistr(s) as a "clever" way to turn a string into something that's compatible with both bytes and strings... :-(

Yep - while I do did plan to publish it on PyPI (with a big "actually using this type may eat your data if you're not careful" warning), I'm also open to the idea of just leaving it as a proof of concept on GitHub. I don't see a lot of actual risk in publishing it though, and I think the demonstrable risks encountered when attempting to use it do a reasonable job of showing *why* we changed away from having a core 8-bit string type that behaved that way. Cheers, Nick.

...

-- --Guido van Rossum (python.org/~guido)

Guido van Rossum

10:46 p.m.

On Tue, Jan 14, 2014 at 1:37 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:

...

Yep - that's why I consider asciistr to be firmly in the "power tool" category. If you know what you're doing, it should let you write hybrid API code that is just as concise as Python 2, but it's also far more error prone than the core Python 3 text model.

Hm. It sounds like the kind of power tool that only candidates for the Darwin award would use. The more I hear you defend it, the less I think it's a good idea for *anything*. And limiting it to PyPy doesn't make it less dangerous. -- --Guido van Rossum (python.org/~guido)

Greg Ewing

10:53 p.m.

New subject: The asciistr problem

Guido van Rossum wrote:

...

I understand that '&' here stands for "any arbitrary combination", but what about searches? Given that asciistr's base class is str, won't it still blow up if you try to use it as an argument to e.g. bytes.startswith()? Equality tests also sound problematic; is b'x' == asciistr('x') == 'x' ???

I'm wondering whether asciistr shouldn't be a *type* at all, but just a function that constructs a string with the same type as another string. All of these problems then go away. Instead of foo.startswith(asciistr("prefix")) you would write foo.startswith(asciistr("prefix", foo)) There's also no chance of an asciistr escaping into the wild, because there's no such thing. We probably want a more compact way of writing it, though. Ideally it would support currying. If we have a number of string literals in our function, we'd like to be able to write something like this at the top: def myfunc(a): s = stringtype(a) ... and then use s('foo') to construct all our string literals inside the function. We could go further. If the function has more than one string argument, they're probably constrained to be of the same type, so in the interests of symmetry it would be nice if we could write def myfunc(a, b): s = stringtype(a, b) ... and have it raise a TypeError if a and b are not of the same string type. -- Greg

Steven D'Aprano

1:08 a.m.

On Tue, Jan 14, 2014 at 10:16:17AM -0800, Guido van Rossum wrote:

...

Hm. It is beginning to sound more and more flawed. I also worry that it will bring back the nightmare of data-dependent UnicodeError back. E.g. this (from tests/basic.py):

def test_asciistr_will_not_accept_codepoints_above_127(self): self.assertRaises(ValueError, asciistr, 'Schrödinger')

looks reasonable enough when you assume asciistr() is always used with a literal as argument -- but I suspect that plenty of people would misunderstand its purpose and write asciistr(s) as a "clever" way to turn a string into something that's compatible with both bytes and strings... :-(

I am one of those people. I've been trying to keep on top of this enormous multiple-thread discussion, and although I haven't read every single post in its entirety, I thought I understand the purpose of asciistr was exactly that, to produce something that was compatible with both bytes and strings. -- Steven

Stephen J. Turnbull

5:57 a.m.

Steven D'Aprano writes:

...

I thought I understand the purpose of asciistr was exactly that, to produce something that was compatible with both bytes and strings.

asciistr *canonizes* something as an ASCII string, and therefore compatible with both bytes and str. It can't *create* such a thing ex nihilo.

Tres Seaver

6:07 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 01/15/2014 12:57 AM, Stephen J. Turnbull wrote:

...

asciistr *canonizes* something as an ASCII string, and therefore compatible with both bytes and str. It can't *create* such a thing ex nihilo.

How many miracles must be attested? Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlLWJbYACgkQ+gerLs4ltQ7RHACfft2ysdHiE9zJM72ycqi0Uqyl s5EAnR9Z21tgqsFVsPUEPiWgtXNxCWF4 =Thyi -----END PGP SIGNATURE-----

Stephen J. Turnbull

10:46 a.m.

Tres Seaver writes:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On 01/15/2014 12:57 AM, Stephen J. Turnbull wrote:

...
asciistr *canonizes* something as an ASCII string, and therefore compatible with both bytes and str. It can't *create* such a thing ex nihilo.

How many miracles must be attested?

You'll have to ask Pope Benno I.

Greg Ewing

8:20 a.m.

Guido van Rossum wrote:

...

I've now looked at asciistr. (Thanks Glenn and Ethan for the link.)

Now that I (hopefully) understand it, I'm worried that a text processing algorithm that uses asciistr might under hard-to-predict circumstances (such as when the arguments contain nothing of interest to the algorithm) might return an asciistr instance instead of a str or bytes instance,

It seems to me that any algorithm with that property has a genuine ambiguity as to what it should return in that case. Arguably, returning an asciistr would be the *right* thing to do, because that would allow it to be used as a component of a larger algorithm that was polymorphic with respect to text/bytes. -- Greg

Guido van Rossum

3:59 p.m.

On Tue, Jan 14, 2014 at 12:20 AM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:

...

Guido van Rossum wrote:

...
I've now looked at asciistr. (Thanks Glenn and Ethan for the link.)

Now that I (hopefully) understand it, I'm worried that a text processing algorithm that uses asciistr might under hard-to-predict circumstances (such as when the arguments contain nothing of interest to the algorithm) might return an asciistr instance instead of a str or bytes instance,

It seems to me that any algorithm with that property has a genuine ambiguity as to what it should return in that case. Arguably, returning an asciistr would be the *right* thing to do, because that would allow it to be used as a component of a larger algorithm that was polymorphic with respect to text/bytes.

Here's an example of what I mean: def spam(a): r = asciistr('(') if a: r += a.strip() r += asciistr(')') return r The argument must be a string. If I call spam(''), a's type is never concatenated with r, so the return value is an asciistr. To fix this particular case, we could drop the "if a:" part. But it could be more significant, e.g. it could be something like "if a contains any digits". The general fix would be to add else: r += a[:0] but that's still an example of the awkwardness that asciistr() is trying to avoid. -- --Guido van Rossum (python.org/~guido)

Guido van Rossum

5:58 p.m.

On Tue, Jan 14, 2014 at 7:59 AM, Guido van Rossum <guido@python.org> wrote:

...

Here's an example of what I mean:

I sent that off without proofreading, and I also got one detail about asciistr() wrong. Here are some corrections.

...

def spam(a): r = asciistr('(') if a: r += a.strip() r += asciistr(')') return r

The argument must be a string.

Or a bytes object. And the point is that the return type should be the same as the argument type.

...

If I call spam(''),

or spam(b'')

...

a's type is never concatenated with r, so the return value is an asciistr.

Actually, Nick explained that asciistr() + asciistr() returns str, so this would be accidentally correct if called with '', but wrong (returning a str instead of a bytes) if called with b''.

...

To fix this particular case, we could drop the "if a:" part. But it could be more significant, e.g. it could be something like "if a contains any digits". The general fix would be to add

else: r += a[:0]

but that's still an example of the awkwardness that asciistr() is trying to avoid.

This is still valid. -- --Guido van Rossum (python.org/~guido)

Greg Ewing

10:12 p.m.

Guido van Rossum wrote:

...

Actually, Nick explained that asciistr() + asciistr() returns str,

That part seems wrong to me, because it means that you can't write polymorphic byte/string functions that are composable. I would be -1 on that, and prefer that asciistr + asciistr --> asciistr. -- Greg

Nick Coghlan

10:21 p.m.

On 15 Jan 2014 08:14, "Greg Ewing" <greg.ewing@canterbury.ac.nz> wrote:

...

Guido van Rossum wrote:

...
Actually, Nick explained that asciistr() + asciistr() returns str,

That part seems wrong to me, because it means that you can't write polymorphic byte/string functions that are composable.

I would be -1 on that, and prefer that asciistr + asciistr --> asciistr.

You have to pretty much reimplement str to do that. I wouldn't say no to a patch that implemented it, but we're unlikely to do that much work ourselves for something which is primarily intended as a proof of concept. Cheers, Nick.

...

-- Greg

_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:

https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com

Greg Ewing

9:59 p.m.

Guido van Rossum wrote:

...

def spam(a): r = asciistr('(') if a: r += a.strip() r += asciistr(')') return r

The general fix would be to add

else: r += a[:0]

The awkwardness might be reducable if asciistr let you write something like r = asciistr('(', a) meaning "give me either a string or bytes containing the value '(', depending on the type of a". But taking a step back, how bad would it really be if an asciistr were returned in this case? Is it just that asciistr doesn't behave exactly like a str in all situations, so it might break something? If so, would it help if asciistr were a built-in type, so that other things could be made aware of it? -- Greg

Nick Coghlan

10:07 p.m.

On 15 Jan 2014 08:00, "Greg Ewing" <greg.ewing@canterbury.ac.nz> wrote:

...

Guido van Rossum wrote:

...
def spam(a): r = asciistr('(') if a: r += a.strip() r += asciistr(')') return r

The general fix would be to add

else: r += a[:0]

The awkwardness might be reducable if asciistr let you write something like

r = asciistr('(', a)

meaning "give me either a string or bytes containing the value '(', depending on the type of a".

But taking a step back, how bad would it really be if an asciistr were returned in this case? Is it just that asciistr doesn't behave exactly like a str in all situations, so it might break something?

If so, would it help if asciistr were a built-in type, so that other things could be made aware of it?

That way lies the Python 2 text model, and we're not going there. It's probably best to think of asciistr as a way of demonstrating a rhetorical point about the superiority of the Python 3 text model rather than something that anyone should actually use in production Python 3 code (although, depending on how rough the edges turn out to be, it *might* eventually find a place in some single source 2/3 code bases, as well as in prototype code and personal scripts). Cheers, Nick.

...

-- Greg

_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:

https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com

Greg Ewing

12:03 a.m.

Nick Coghlan wrote:

...

On 15 Jan 2014 08:00, "Greg Ewing" <greg.ewing@canterbury.ac.nz <mailto:greg.ewing@canterbury.ac.nz>> wrote:

...
If so, would it help if asciistr were a built-in type, so that other things could be made aware of it?

That way lies the Python 2 text model, and we're not going there. It's probably best to think of asciistr as a way of demonstrating a rhetorical point about the superiority of the Python 3 text model

Hmmm... something like "The Python 3 text model is so superior that we have to use this weird hack to write something that makes perfectly good semantic sense but is very awkward to write otherwise" ?-) Anyhow, I've now convinced myself that asciistr as a type is completely unnecessary -- see earlier post. -- Greg

Steven D'Aprano

2:07 a.m.

On Wed, Jan 15, 2014 at 01:03:13PM +1300, Greg Ewing wrote:

...

Nick Coghlan wrote:

...

...
That way lies the Python 2 text model, and we're not going there. It's probably best to think of asciistr as a way of demonstrating a rhetorical point about the superiority of the Python 3 text model

Hmmm... something like "The Python 3 text model is so superior that we have to use this weird hack to write something that makes perfectly good semantic sense but is very awkward to write otherwise" ?-)

I don't think mixing bytes and strings makes good semantic sense. If this discussion has taught me anything, it is that mixing the two is "Here Be Dragons" territory, fraught with danger. It may be that there are applications where mixing them is *unavoidable*, but I think that it's never *sensible*. -- Steven

Greg Ewing

4:18 a.m.

Steven D'Aprano wrote:

...

I don't think mixing bytes and strings makes good semantic sense.

It's not about mixing bytes and text -- it's about writing polymorphic code that will work on either bytes *or* text. Not both at the same time. If we had quantum computers, this would be easy to solve: asciistr would be of type str/sqrt(2) + bytes/sqrt(2), and everything would work out fine. :-) -- Greg

Stephen J. Turnbull

9:11 a.m.

Nick Coghlan writes:

...

"Give up" makes it sound like I got tired of arguing without being convinced rather than admitting I was just plain wrong.

I thought it was something in between (you explicitly said "lenient PEP 460" doesn't hurt you, but my understanding was you still believe that there's a safer way, and it's the latter you aren't going to try to convince folks of).

...

While I'll still work on the asciistr proposal,

Thank you for that. I really wish I had time to, myself, but not for several weeks... :-(

...

that's unrelated to PEP 460 - it's about making hybrid APIs less

"It" refers to asciistr or to PEP 460?

...

painful to write in Python 3 when you're willing to place the burden of ensuring ASCII compatibility of binary data on the calling code.

Versus what?

Nick Coghlan

9:39 a.m.

On 14 Jan 2014 19:11, "Stephen J. Turnbull" <stephen@xemacs.org> wrote:

...

Nick Coghlan writes:

...
"Give up" makes it sound like I got tired of arguing without being convinced rather than admitting I was just plain wrong.

I thought it was something in between (you explicitly said "lenient PEP 460" doesn't hurt you, but my understanding was you still believe that there's a safer way, and it's the latter you aren't going to try to convince folks of).

I did say that at one point (when Guido first objected to the formatb idea), but I switched to complete agreement after he pointed out the ASCII assumption embedded in the formatting syntax itself.

...

...
While I'll still work on the asciistr proposal,

Thank you for that. I really wish I had time to, myself, but not for several weeks... :-(

Heh, depending on how many quirky edge cases we find, we may still be working on it by then, especially since there are still a few docs updates and other fixes I want to get into Python 3.4.

...

...
that's unrelated to PEP 460 - it's about making hybrid APIs less

"It" refers to asciistr or to PEP 460?

asciistr

...

...
painful to write in Python 3 when you're willing to place the burden of ensuring ASCII compatibility of binary data on the calling code.

Versus what?

Versus doing explicit decoding the way urllib.parse does - it only accepts strict 7-bit ASCII as binary input by default, so you have to decode to text externally in order to handle arbitrary input that may contain other bytes. Cheers, Nick.

...

Ethan Furman

9:50 p.m.

On 01/11/2014 10:32 AM, Antoine Pitrou wrote:

...

On Sat, 11 Jan 2014 18:41:49 +0100 Victor Stinner <victor.stinner@gmail.com> wrote:

...
b'x=%s' % 10 is well defined, it's pure bytes.

It is well-defined? Then please explain me what the general case of b'%s' % x is supposed to call:

This is the key question, isn't it?

...

- does it call x.__bytes__? int.__bytes__ doesn't exist

Perhaps that's the problem. According to the docs: ======================================================================== object.__bytes__(self) Called by bytes() to compute a byte-string representation of an object. This should return a bytes object. ======================================================================== Obviously, with the plethora of different binary possibilities for representing a number (how many bytes? endianness? which complement?), we would be well within our rights to decide that the "byte-string representation" of the numeric types is the ASCII equivalent of their __repr__ or __str__, and implement __bytes__ appropriately for them. Any other object that wants to be represented easily in a byte stream would also have to implement __bytes__. If necessary we could add __bytes__ to str for /strict/ ASCII conversion (even latin-1 would have to be explicitly encoded)[1]. -- ~Ethan~ [1] I'm iffy on this point as I'm not at all sure it's needed.

Victor Stinner

11:11 p.m.

2014/1/11 Ethan Furman <ethan@stoneleaf.us>:

...

...
...
b'x=%s' % 10 is well defined, it's pure bytes.

It is well-defined? Then please explain me what the general case of b'%s' % x is supposed to call:

This is the key question, isn't it?

Python 2 and Python 3 are very different here. In Python 2, the "s" format of PyArg_Parse may call the __str__() method of an object. In Python 3, the "y*" format of PyArg_Parse uses the Py_buffer API which has no slot (there is no protocol like a __getbuffer__() method). The Py_buffer can only be implemented in C. For example, bytes, bytearray and memoryview implement it. PyArg_Parse requires also the buffer to be C-contiguous and has a single segment (use PyBUF_SIMPLE flag). Said differently, bytes%args and bytes.format() would *not* call any method. Victor

Glenn Linderman

1:10 a.m.

On 1/11/2014 1:50 PM, Ethan Furman wrote:

...

Perhaps that's the problem. According to the docs: ======================================================================== object.__bytes__(self)

Called by bytes() to compute a byte-string representation of an object. This should return a bytes object. ========================================================================

Obviously, with the plethora of different binary possibilities for representing a number (how many bytes? endianness? which complement?), we would be well within our rights to decide that the "byte-string representation" of the numeric types is the ASCII equivalent of their __repr__ or __str__, and implement __bytes__ appropriately for them. Any other object that wants to be represented easily in a byte stream would also have to implement __bytes__. If necessary we could add __bytes__ to str for /strict/ ASCII conversion (even latin-1 would have to be explicitly encoded)[1].

In spite of Victor's explanation of internals, which I didn't understand, this sounds like a very interesting idea, conceptually, that any object could implement its __bytes__representation. On the other hand, it would probably have to be parameterized in the general case: for binary data values, one protocol or format may wish the data to be big-endian, and another may wish the data to be little-endian; for str, one protocol or format may require one encoding and another may require a different encoding, even (as for email) for different parts of the message. So it could be somewhat complex, yet would be very powerful in allowing complex objects, made up of other objects, some of which might have a variety of potential bytes formats (think TIFF files, for example) to convert themselves into a stream of bytes that fits the standard. On the flip side, one would want to convert the stream of bytes into the set of objects, which is a parsing problem. This is a bit beyond what can be done automatically, just by calling __bytes__ with no parameters, though. What it may be, though, is a meta-operation from which the needed bytes operations can be determined. It may also not be an easy "compatible with existing Python 2 code with minor tweaks" solution, either. It would be more like a pickle protocol, but pickle defines its own formats, and thus is useless for creating standard formats. I guess it would belong on python-ideas.

Victor Stinner

1:01 a.m.

Hi, 2014/1/11 Antoine Pitrou <solipsis@pitrou.net>:

...

...
b'x=%s' % 10 is well defined, it's pure bytes.

It is well-defined? Then please explain me what the general case of b'%s' % x is supposed to call:

- does it call x.__bytes__? int.__bytes__ doesn't exist - does it call bytes(x)? bytes(10) gives b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' - does it call x.__str__? you've reintroduced the Python 2 behaviour of conflating bytes and unicode

I don't want to call any method from bytes%args, only Py_buffer API would be used. So the pseudo-code becomes: - try to get Py_buffer - on failure, check if it's an int: yes? ok, format it as decimal - otherwise, raise an error Or: - is the object an int? yes, format it as decimal. no, use Py_buffer -- I discussed with Antoine to try to understand how and why we disagree. Antoine prefers a pure API, whereas I'm trying to figure out if it would be possible to write code compatible with Python 2 and Python 3. Using Antoine's PEP, it's possible to write code working on Python 2 and Python 3 which only manipulate bytes strings. The problem is that it's a pain to write a code working on both Python versions when an argument is an integer. For example, the Python 2 code "Content-Length: %s\r\n" % 123 is written ("Content-Length: %s\r\n" % 123).encode('ascii') in Python 3. So Python 2 and Python 3 codes are different. Supporting formating integers would allow to write b"Content-Length: %s\r\n" % 123, which would work on Python 2 and Python 3. (u'Content-Length: %s\r\n' % 123).encode('ascii') works on both Python versions, but it may require more work to Python 2 code on Python 3. -- Now I'm trying to find use cases in Mercurial and Twisted source code to see which features are required. First, I'm looking for a function requiring to format a number in decimal in a bytes string. In issue #3982, I saw: """ HTTP chunking' uses ASCII mixed with binary (octets). With 2.6 you could write: def chunk(block): return b'{0:x}\r\n{1}\r\n'.format(len(block), block)" """ and """ 'Content-length: {}\r\n'.format(length) """ But are the examples real use cases, or artifical examples? -- Augie Fackler gave an example from Mercurial: """ sys.stdout.write('%(state)s %(path)s\n' % {'state': 'M', 'path': 'some/filesystem/path'}) except we don't know the encoding of the filesystem path (Hi unix!) so we have to treat the whole thing as opaque bytes. It's even more fun for 'log', becase then it's got localized strings in it as well. """ But here I disagree with the design of Mercurial, filenames should be treated as text. If a filename would be pure binary, you should not write it in a terminal. Displaying binary data usually leads to displaying random characters and changing terminal options (ex: text starts blinking or is displayed in bold!?) :-) For the localized string: again, it's also a design issue in my opinion. A localized string is text, not binary data :-) -- Another option is that I cannot find usecases because there are no use cases for the PEP 460 and the PEP is useless :-) Victor

Paul Moore

8:57 a.m.

On 12 January 2014 01:01, Victor Stinner <victor.stinner@gmail.com> wrote:

...

Supporting formating integers would allow to write b"Content-Length: %s\r\n" % 123, which would work on Python 2 and Python 3.

I'm surprised that no-one is mentioning b"Content-Length: %s\r\n" % str(123) which works on Python 2 and 3, is explicit, and needs no special-casing of int in the format code. Paul

Georg Brandl

9:23 a.m.

Am 12.01.2014 09:57, schrieb Paul Moore:

...

On 12 January 2014 01:01, Victor Stinner <victor.stinner@gmail.com> wrote:

...
Supporting formating integers would allow to write b"Content-Length: %s\r\n" % 123, which would work on Python 2 and Python 3.

I'm surprised that no-one is mentioning b"Content-Length: %s\r\n" % str(123) which works on Python 2 and 3, is explicit, and needs no special-casing of int in the format code.

Certainly doesn't work on Python 3 right now, and never should :) Georg

Paul Moore

12:08 p.m.

On 12 January 2014 09:23, Georg Brandl <g.brandl@gmx.net> wrote:

...

...
On 12 January 2014 01:01, Victor Stinner <victor.stinner@gmail.com> wrote:

...
Supporting formating integers would allow to write b"Content-Length: %s\r\n" % 123, which would work on Python 2 and Python 3.

I'm surprised that no-one is mentioning b"Content-Length: %s\r\n" % str(123) which works on Python 2 and 3, is explicit, and needs no special-casing of int in the format code.

Certainly doesn't work on Python 3 right now, and never should :)

Sorry, I meant str(123).encode("ascii"), and I'd probably use a helper function for it. I could easily argue at this point that this is the type of bug that having %-formatting operations on bytes would encourage - %s means "format a string" (from years of C and Python (text) experience) so I automatically supply a string argument when using %s in a bytes formatting context. The reality is that I was probably just being sloppy, though :-) Paul

Nick Coghlan

1:23 p.m.

On 12 Jan 2014 22:10, "Paul Moore" <p.f.moore@gmail.com> wrote:

...

On 12 January 2014 09:23, Georg Brandl <g.brandl@gmx.net> wrote:

...
...
On 12 January 2014 01:01, Victor Stinner <victor.stinner@gmail.com>

wrote:

...

...
...
...
Supporting formating integers would allow to write b"Content-Length: %s\r\n" % 123, which would work on Python 2 and Python 3.

I'm surprised that no-one is mentioning b"Content-Length: %s\r\n" % str(123) which works on Python 2 and 3, is explicit, and needs no special-casing of int in the format code.

Certainly doesn't work on Python 3 right now, and never should :)

Sorry, I meant str(123).encode("ascii"), and I'd probably use a helper function for it.

I could easily argue at this point that this is the type of bug that having %-formatting operations on bytes would encourage - %s means "format a string" (from years of C and Python (text) experience) so I automatically supply a string argument when using %s in a bytes formatting context.

The reality is that I was probably just being sloppy, though :-)

It's also something asciistr will help with once it is working - asciistr(123) on the RHS will work in both versions. Cheers, Nick.

...

Paul _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com

Greg Ewing

9:06 p.m.

Paul Moore wrote:

...

I could easily argue at this point that this is the type of bug that having %-formatting operations on bytes would encourage - %s means "format a string" (from years of C and Python (text) experience) so I automatically supply a string argument when using %s in a bytes formatting context.

So don't call it %s -- call it something else such as %b. -- Greg

Mark Lawrence

9:16 p.m.

On 12/01/2014 21:06, Greg Ewing wrote:

...

Paul Moore wrote:

...
I could easily argue at this point that this is the type of bug that having %-formatting operations on bytes would encourage - %s means "format a string" (from years of C and Python (text) experience) so I automatically supply a string argument when using %s in a bytes formatting context.

So don't call it %s -- call it something else such as %b.

Sorry but you can't use %b as that'll confuse people who're used to it meaning "Month as locale’s abbreviated name." :) -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence

Ethan Furman

9:26 p.m.

On 01/12/2014 01:06 PM, Greg Ewing wrote:

...

Paul Moore wrote:

...
I could easily argue at this point that this is the type of bug that having %-formatting operations on bytes would encourage - %s means "format a string" (from years of C and Python (text) experience) so I automatically supply a string argument when using %s in a bytes formatting context.

So don't call it %s -- call it something else such as %b.

Which is fine for 3.5+ code, but not at all helpful for a 2/3 code base. -- ~Ethan~

Kristján Valur Jónsson

2:50 p.m.

Well, my suggestion would that we _should_ make it work, by having the %s format specifyer on bytes objects mean: str(arg).encode('ascii', 'strict') It would be an explicit encoding operator with a known, fixed, and well specified encoder. This would cover most of the use cases seen in this threadnought. Others could be handled with explicit str formatting and encoding. Imho, this is not equivalent to re-introducing automatic type conversion between binary/unicode, it is adding a specific convenience function for explicitly asking for ASCII encoding. K ________________________________________ From: Python-Dev [python-dev-bounces+kristjan=ccpgames.com@python.org] on behalf of Georg Brandl [g.brandl@gmx.net] Sent: Sunday, January 12, 2014 09:23 To: python-dev@python.org Subject: Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake Am 12.01.2014 09:57, schrieb Paul Moore:

...

On 12 January 2014 01:01, Victor Stinner <victor.stinner@gmail.com> wrote:

...
Supporting formating integers would allow to write b"Content-Length: %s\r\n" % 123, which would work on Python 2 and Python 3.

I'm surprised that no-one is mentioning b"Content-Length: %s\r\n" % str(123) which works on Python 2 and 3, is explicit, and needs no special-casing of int in the format code.

Certainly doesn't work on Python 3 right now, and never should :) Georg

Nick Coghlan

4:09 p.m.

On 13 Jan 2014 01:22, "Kristján Valur Jónsson" <kristjan@ccpgames.com> wrote:

...

Well, my suggestion would that we _should_ make it work, by having the %s

format specifyer on bytes objects mean: str(arg).encode('ascii', 'strict')

...

It would be an explicit encoding operator with a known, fixed, and well specified encoder. This would cover most of the use cases seen in this threadnought. Others could be handled with explicit str formatting and encoding.

Imho, this is not equivalent to re-introducing automatic type conversion between binary/unicode, it is adding a specific convenience function for explicitly asking for ASCII encoding.

It is not explicit, it is implicit - whether or not the resulting string assumes ASCII compatibility or not depends on whether you pass a binary value (no assumption) or a string value (assumes ASCII compatibility). This kind of data driven change in assumptions about correctness is utterly unacceptable in the core text and binary types in Python 3. It's also completely unnecessary - asciistr will be a third party extension type that allows those users pining for the halcyon days of the Python 2 str type to stop harassing the core devs with requests to compromise the core Python 3 text model with implicit encoding operations. I'll ensure any interoperability bugs between asciistr and the core types that can't be worked around get fixed. A separate type is genuinely explicit (since the ASCII assumption is no longer hidden from the type system), and allows much simpler interoperability for code that wants (indexing asciistr will eventually produce length 1 asciistr instances instead of str instances, it will avoid the bytes(intval) discrepancy, it will avoid the str(bytesval) problem, etc). I've been suggesting for years that Python 3 might need a third type (not required to be a builtin, since it's so specialised), but folks migrating from Python 2 have been so focused on making the core binary type a hybrid type again, the notion of taking advantage of PEP 393 to create a dedicated extension type specifically for working with ASCII compatible binary protocols has failed to compute. I'm hoping a test suite and preliminary implementation will help more people to finally get the point. Regards, Nick.

...

K ________________________________________ From: Python-Dev [python-dev-bounces+kristjan=ccpgames.com@python.org] on

behalf of Georg Brandl [g.brandl@gmx.net]

...

Sent: Sunday, January 12, 2014 09:23 To: python-dev@python.org Subject: Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

Am 12.01.2014 09:57, schrieb Paul Moore:

...
On 12 January 2014 01:01, Victor Stinner <victor.stinner@gmail.com> wrote:

...
Supporting formating integers would allow to write b"Content-Length: %s\r\n" % 123, which would work on Python 2 and Python 3.

I'm surprised that no-one is mentioning b"Content-Length: %s\r\n" % str(123) which works on Python 2 and 3, is explicit, and needs no special-casing of int in the format code.

Certainly doesn't work on Python 3 right now, and never should :)

Georg _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com

Ethan Furman

4:21 p.m.

On 01/12/2014 08:09 AM, Nick Coghlan wrote:

...

On 13 Jan 2014 01:22, "Kristján Valur Jónsson" wrote:

...
Imho, this is not equivalent to re-introducing automatic type conversion between binary/unicode, it is adding a specific convenience function for explicitly asking for ASCII encoding.

It is not explicit, it is implicit - whether or not the resulting string assumes ASCII compatibility or not depends on whether you pass a binary value (no assumption) or a string value (assumes ASCII compatibility).

Nick, I don't understand what you are saying here. Are you saying that the result of b'%s' % var may be either a bytes object or a str object? Because that would be wrong -- it would always be a bytes object. -- ~Ethan~

Ethan Furman

5:03 p.m.

On 01/12/2014 08:21 AM, Ethan Furman wrote:

...

On 01/12/2014 08:09 AM, Nick Coghlan wrote:

...
On 13 Jan 2014 01:22, "Kristján Valur Jónsson" wrote:

...
Imho, this is not equivalent to re-introducing automatic type conversion between binary/unicode, it is adding a specific convenience function for explicitly asking for ASCII encoding.

It is not explicit, it is implicit - whether or not the resulting string assumes ASCII compatibility or not depends on whether you pass a binary value (no assumption) or a string value (assumes ASCII compatibility).

Nick, I don't understand what you are saying here. Are you saying that the result of b'%s' % var may be either a bytes object or a str object? Because that would be wrong -- it would always be a bytes object.

Okay, I just went and took a closer look at the asciistr type [1]. For what it's worth I don't think this is Antoine's understanding of what we [2] are asking for, nor is it what we are asking for (I'm sure Antoine will correct me if I'm wrong. ;) We know full well the difference between unicode and bytes, and we know full well that numbers and much of the text we need has an ASCII (bytes!) representation. When we do a b'Content Length: %d' % len(binary_data) we are expecting to get back a bytes object, /not/ a unicode object. Your asciistr, which sometimes returns bytes and sometimes returns text, is absolutely *not* what we want. -- ~Ethan~ [1] https://github.com/jeamland/asciicompat [2] the dbf and pdf folks, at least

Paul Moore

5:26 p.m.

On 12 January 2014 17:03, Ethan Furman <ethan@stoneleaf.us> wrote:

...

We know full well the difference between unicode and bytes, and we know full well that numbers and much of the text we need has an ASCII (bytes!) representation. When we do a b'Content Length: %d' % len(binary_data) we are expecting to get back a bytes object, /not/ a unicode object.

What I am struggling to understand here is what room for compromise there is. Clearly, for whatever reason, b'Content Length: ' + str(len(binary_data)).encode('ascii')) is not acceptable for you. OK, fair enough. Also, apparently, writing a helper def int_to_bytes(n): return str(n).encode('ascii') b'Content Length: ' + int_to_bytes(len(binary_data)) is unacceptable. But I'm not clear why it's unacceptable. Maybe I missed the explanation - God knows, the thread is long enough :-) On the other hand, Nick has explained why b'Content Length: %d' % len(binary_data) is unacceptable to him (you don't have to agree with his opinion, just concede that he has explained his position in a way that you understand). I'm not trying to argue you're wrong - I don't know your codebase, nor do I know your application area. But surely somewhere between "we must have % formatting including %d for bytes" and the above, there's a middle ground that you *are* willing to accept? Can you give any indications of what that might be? What, specifically, about the helper function is the problem? I don't think it is any less space efficient, it doesn't double-encode, and I don't think it's more difficult to understand (although it is a little longer, it trades that off against being a bit more explicit as to what's going on). Surely you're not arguing that your code must work unchanged (not "there's a way of writing the code so it works on Python 2 and 3", but "the code you currently have for Python 2 must work with no changes at all")? Can you give an example of code that is *nearly* acceptable to you, which works in Python 2 and 3 today, and explain what improvements you would like to see to it in order to use it instead of waiting for a core change? Paul

Ethan Furman

6:26 p.m.

On 01/12/2014 09:26 AM, Paul Moore wrote:

...

On 12 January 2014 17:03, Ethan Furman <ethan@stoneleaf.us> wrote:

...
We know full well the difference between unicode and bytes, and we know full well that numbers and much of the text we need has an ASCII (bytes!) representation. When we do a b'Content Length: %d' % len(binary_data) we are expecting to get back a bytes object, /not/ a unicode object.

What I am struggling to understand here is what room for compromise there is. Clearly, for whatever reason,

b'Content Length: ' + str(len(binary_data)).encode('ascii'))

is not acceptable for you. OK, fair enough. Also, apparently, writing a helper

def int_to_bytes(n): return str(n).encode('ascii')

b'Content Length: ' + int_to_bytes(len(binary_data))

is unacceptable. But I'm not clear why it's unacceptable. Maybe I missed the explanation - God knows, the thread is long enough :-)

True enough! ;) It's unacceptable in the sense that the bytes type is /almost/ there, it's /almost/ what is needed to handle the boundary conditions. We have a __bytes__ method (how is it supposed to be used?) that could be made to fit the interpolation bill. It seems to me the core of Nick's refusal is the (and I agree!) rejection of bytes interpolation returning unicode -- but that's not what I'm asking for! I'm asking for it to return bytes, with the interpolated data (in the case if %d, %s, etc) being strictly-ASCII encoded.

...

On the other hand, Nick has explained why b'Content Length: %d' % len(binary_data) is unacceptable to him (you don't have to agree with his opinion, just concede that he has explained his position in a way that you understand).

Only because he (or Benno) finally wrote some tests and I was able to see what he thought I was wanting. Which does seem to leave a *tiny* bit of wiggle room if bytes interpolation always return bytes, and never a unicode (yeah, I know, snowball's chance and all that).

...

I'm not trying to argue you're wrong - I don't know your codebase, nor do I know your application area. But surely somewhere between "we must have % formatting including %d for bytes" and the above, there's a middle ground that you *are* willing to accept? Can you give any indications of what that might be? What, specifically, about the helper function is the problem? I don't think it is any less space efficient, it doesn't double-encode, and I don't think it's more difficult to understand (although it is a little longer, it trades that off against being a bit more explicit as to what's going on). Surely you're not arguing that your code must work unchanged (not "there's a way of writing the code so it works on Python 2 and 3", but "the code you currently have for Python 2 must work with no changes at all")?

I'm arguing from three PoVs: 1) 2 & 3 compatible code base 2) having the bytes type /be/ the boundary type 3) readable code

...

Can you give an example of code that is *nearly* acceptable to you, which works in Python 2 and 3 today, and explain what improvements you would like to see to it in order to use it instead of waiting for a core change?

I'm not trying to be difficult (just naturally good at it, I guess ;) , but I don't see a lot room for compromises -- I would like % interpolation, I'm told I have to use a helper function. I will if I have to, but first I have to try and make myself understood, and I'm not sure that has happened yet. Following Nick's example I'm writing up some tests that clearly show what I would like to see. Then at least we can debate what I'm actually asking for, and now what the (understandably) unicode-what-a-mess-we-had-in-py2k-don't-want-again that some think I am asking for. -- ~Ethan~

Paul Moore

7 p.m.

On 12 January 2014 18:26, Ethan Furman <ethan@stoneleaf.us> wrote:

...

True enough! ;) It's unacceptable in the sense that the bytes type is /almost/ there, it's /almost/ what is needed to handle the boundary conditions. We have a __bytes__ method (how is it supposed to be used?) that could be made to fit the interpolation bill.

And yet I still don't follow what you *want*. Unless it's that b'%d' % (12,) must work and give b'12', and nothing else is acceptable. Maybe more accurately, I don't see what you want to do that can't be done in another way. All I'm seeing in your rejection of alternative suggestions is "it's not %-interpolation using %d".

...

I'm arguing from three PoVs: 1) 2 & 3 compatible code base 2) having the bytes type /be/ the boundary type 3) readable code

The only one of these that I can see being in any way an argument against def int_to_bytes(n): return str(n).encode('ascii') b'Content Length: ' + int_to_bytes(len(binary_data)) is (3), and that's largely subjective. Personally, I see very little difference between the above and %d-interpolation in terms of *readability*. Brevity, certainly %d wins. But that's not important on its own, and I'd argue that my version is more clear in terms of describing the intent (and would be even better if I wasn't rubbish at thinking of function names, or if this wasn't in isolation, and more application-focused functions were used).

...

It seems to me the core of Nick's refusal is the (and I agree!) rejection of bytes interpolation returning unicode -- but that's not what I'm asking for! I'm asking for it to return bytes, with the interpolated data (in the case if %d, %s, etc) being strictly-ASCII encoded.

My reading of Nick's refusal is that %d takes a value which is semantically a number, converts it into a base-10 representation (which is semantically a *string*, not a sequence of bytes[1]) and then *encodes* that string into a series of bytes using the ASCII encoding. That is *two* semantic transformations, and one (the ASCII encoding) is *implicit*. Specifically, it's implicit because (a) the normal reading of %d is "produce the base-10 representation of a number, and a base-10 representation is a *string*, and (b) because nowhere has ASCII been mentioned (why not UTF16? that would be entirely plausible for a wchar-based environment like Windows). And a core principle of the bytes/text separation in Python 3 is that encoding should never happen implicitly. By the way, I should point out that I would never have understood *any* of the ideas involved in this thread before Python 3 forced me to think about Unicode and the distinction between text and bytes. And yet, I now find myself, in my (non-Python) work environment, being the local expert whenever applications screw up text encodings. So I, for one, am very grateful for Python 3's clear separation of bytes and text. (And if I sometimes come across as over-dogmatic, I apologise - put it down to the enthusiasm of the recent convert :-)) Paul [1] If you cannot see that there's no essential reason why the base-10 representation '123' should correspond to the bytes b'\x31\x32\x33' then you are probably not old enough to have started programming on EBCDIC-based computers :-)

Ethan Furman

7:14 p.m.

On 01/12/2014 11:00 AM, Paul Moore wrote:

...

And yet I still don't follow what you *want*. Unless it's that b'%d' % (12,) must work and give b'12', and nothing else is acceptable.

Nothing else is ideal. I'll go that route if I have to. I understand that in the real world you go with what works, but in the development stage you fight for the ideal. :)

...

My reading of Nick's refusal is that %d takes a value which is semantically a number, converts it into a base-10 representation (which is semantically a *string*, not a sequence of bytes[1]) and then *encodes* that string into a series of bytes using the ASCII encoding. That is *two* semantic transformations, and one (the ASCII encoding) is *implicit*. Specifically, it's implicit because (a) the normal reading of %d is "produce the base-10 representation of a number, and a base-10 representation is a *string*, and (b) because nowhere has ASCII been mentioned (why not UTF16? that would be entirely plausible for a wchar-based environment like Windows). And a core principle of the bytes/text separation in Python 3 is that encoding should never happen implicitly.

That could be. And yet the bytes type already has several concessions to ASCII encoding.

...

By the way, I should point out that I would never have understood *any* of the ideas involved in this thread before Python 3 forced me to think about Unicode and the distinction between text and bytes. And yet, I now find myself, in my (non-Python) work environment, being the local expert whenever applications screw up text encodings. So I, for one, am very grateful for Python 3's clear separation of bytes and text. (And if I sometimes come across as over-dogmatic, I apologise - put it down to the enthusiasm of the recent convert :-))

No worries. I was forced to learn the difference when I wrote my dbf module for 2.5. Took longer than I'd like to admit to realize that ASCII was an encoding. :/

...

[1] If you cannot see that there's no essential reason why the base-10 representation '123' should correspond to the bytes b'\x31\x32\x33' then you are probably not old enough to have started programming on EBCDIC-based computers :-)

I can see it. :) But bytes already acknowledges an ASCII bias. ;) And even EBCDIC machines speak ASCII when talking telnet. -- ~Ethan~

Glenn Linderman

9:59 p.m.

On 1/12/2014 11:14 AM, Ethan Furman wrote:

...

...
And a core principle of the bytes/text separation in Python 3 is that encoding should never happen implicitly.

That could be. And yet the bytes type already has several concessions to ASCII encoding.

"%d" % 26 => an explicit request to convert binary integer to a base-10 Unicode/text representation of the integer b"%d" % 26 => an explicit request to convert binary integer to a base-10 ASCII bytes representation of the integer The leading "b" seems to be a very explicit request for bytes rather than characters to me, and seems much more attractive than the proposals to embed binary in Unicode by abusing Latin-1 encoding.

Stephen J. Turnbull

12:08 a.m.

Glenn Linderman writes:

...

the proposals to embed binary in Unicode by abusing Latin-1 encoding.

Those aren't "proposals", they are currently feasible techniques in Python 3 for *some* use cases. The question is why infecting Python 3 with the byte/character confoundance virus is preferable to such techniques, especially if their (serious!) deficiencies are removed by creating a new type such as asciistr.

Glenn Linderman

6:46 a.m.

On 1/12/2014 4:08 PM, Stephen J. Turnbull wrote:

...

Glenn Linderman writes:

...
the proposals to embed binary in Unicode by abusing Latin-1 encoding.

Those aren't "proposals", they are currently feasible techniques in Python 3 for *some* use cases.

The question is why infecting Python 3 with the byte/character confoundance virus is preferable to such techniques, especially if their (serious!) deficiencies are removed by creating a new type such as asciistr. "smuggled binary" (great term borrowed from a different subthread) muddies the waters of what you are dealing with. As long as the actual data is only Latin-1 and smuggled binary, the technique probably isn't too bad... you can define the the "smuggled binary" as a "decoding" of binary to text, sort of like base64 "decodes" binary to ASCII. And it can be a useful technique.

As soon as you introduce "smuggled non-ASCII, non-Latin-1 text" encodings into the mix, it gets thoroughly confusing... just as confusing as the Python 2 text model. It takes decode+encode to do the smuggled text, plus encode push it to the boundary, plus you have text that you know is text, but because of the required techniques for smuggling it, you can't operate on it or view it properly as the text that it should be. The "byte/character confoundance virus" is a hobgoblin of paranoid perception. In another post, I pointed out that ''' b"%d" % 25 ''' is not equivalent to ''' "%d" % 25 ''' because of the "b" in the first case. So the "implicit" encoding that everyone on that side of the fence was talking about was not at all implicit, but explicit. The numeric characters produced by %d are clearly in the ASCII subset of text, so having b"%d" % 25 produce pre-encoded ASCII text is explicit and practical. My only concern was what b"%s" % 'abc' should do, because in general, str may not contain only ASCII. (generalize to b"%s" % str(...) ). Guido solved that one nicely. Of course, at this point, I could punt the whole argument off to "Guido said so", but since you asked me, I felt it appropriate to respond from my perspective... and I'm not sure Guido specifically addressed your smuggled binary proposal. When the mixture of text and binary is done as encoded text in binary, then it is obvious that only limited text processing can be performed, and getting the text there requires that it was encoded (hopefully properly encoded per the binary specification being created) to become binary. And there are no extra, confusing Latin-1 encode/decode operations required. From a higher-level perspective, I think it would be great to have a module, perhaps called "boundary" (let's call it that for now), that allow some definition syntax (augmented BNF? augmented ABNF?) to explain the format of a binary blob. And then provide methods for generating and parsing it to/from Python objects. Obviously, the ABNF couldn't understand Python objects; instead, Python objects might define the ABNF to which they correspond, and methods for accepting binary and producing the object (factory method?) and methods for generating the binary. As objects build upon other objects, the ABNF to which the correspond could be constructed, and perhaps even proven to be capable of parsing all valid blobs corresponding to the specification, and perhaps even proven to be capable of generating only valid blobs (although I'm not a software proof guru; last I heard there were definite limits on the ability to do proofs, but maybe this is a limited enough domain it could work). Then all blobs could be operated on sort of like web browsers operate on the DOM, or some XML parsing libraries, by defining each blob as a collection of objects for the pieces. XML is far too wordy for practical use (but hey! it is readable) but perhaps it could be practical if tokenized, and then the tokenized representation could be converted to a DOM just like XML and HTML are. (this is mostly to draw the parallel in the parsing and processing techniques; I'm not seriously suggesting a binary version of XML, but there is a strong parallel, and it could be done). Given a DOM-like structure, a validator could be written to operate on it, though, to provide, if not a proof, at least a sanity check. And, given the DOM-like structure, one call to the top-level object to generate the blob format would walk over all of them, generating the whole blob. Off I go, drifting into Python ideas.... but I have a program I want to rewrite that could surely use some of these techniques (and probably will), because it wants to read several legacy formats, and produce several legacy formats, as well as a new, more comprehensive format. So the objects will be required to parse/generate 4 different blob structures, one of which has its own set of several legacy variations.

Stephen J. Turnbull

2:43 p.m.

Glenn Linderman writes:

...

On 1/12/2014 4:08 PM, Stephen J. Turnbull wrote:

...
Glenn Linderman writes:

...
the proposals to embed binary in Unicode by abusing Latin-1 encoding.

...

...
Those aren't "proposals", they are currently feasible techniques in Python 3 for *some* use cases. The question is why infecting Python 3 with the byte/character confoundance virus is preferable to such techniques, especially if their (serious!) deficiencies are removed by creating a new type such as asciistr.

...

"smuggled binary" (great term borrowed from a different subthread) muddies the waters of what you are dealing with.

Not really. The "mud" is one or more of the serious deficiencies. It can be removed, I believe (and Nick apparently does, too). "asciistr" is one way to try that.

...

When the mixture of text and binary is done as encoded text in binary, then it is obvious that only limited text processing can be performed,

Hardly. After all, that's how all text processing was done for decades. Still is, in some programs, especially C programs.

...

And there are no extra, confusing Latin-1 encode/decode operations required.

The "extra" encode/decode operations are mostly (perhaps all) due to examples that started from bytes and end with bytes. Of course if you assume that API and propose to do the operations using Unicode, you'll get "extra" decode/encode operations.

...

From a higher-level perspective, I think it would be great to have a module, perhaps called "boundary" (let's call it that for now), that allow some definition syntax (augmented BNF? augmented ABNF?) to explain the format of a binary blob.

We have struct, for one. I'm not sure why you want more than that. I suppose you could go all the way to ASN.1.

Glenn Linderman

8:44 p.m.

On 1/13/2014 6:43 AM, Stephen J. Turnbull wrote:

...

Glenn Linderman writes:

...
On 1/12/2014 4:08 PM, Stephen J. Turnbull wrote:

...
Glenn Linderman writes:

...
the proposals to embed binary in Unicode by abusing Latin-1 encoding.

...
...
Those aren't "proposals", they are currently feasible techniques in Python 3 for *some* use cases. The question is why infecting Python 3 with the byte/character confoundance virus is preferable to such techniques, especially if their (serious!) deficiencies are removed by creating a new type such as asciistr.

...
"smuggled binary" (great term borrowed from a different subthread) muddies the waters of what you are dealing with.

Not really. The "mud" is one or more of the serious deficiencies. It can be removed, I believe (and Nick apparently does, too). "asciistr" is one way to try that.

Yes really. Use of smuggled binary means the str containing it can no longer be treated completely as a str. That is "muddier" than having a str that is only a str.

...

...
When the mixture of text and binary is done as encoded text in binary, then it is obvious that only limited text processing can be performed,

Hardly. After all, that's how all text processing was done for decades. Still is, in some programs, especially C programs.

I disagree, and so do you... text processing must be limited to the text subsets of the text that includes smuggled binary... that is limited... you can't just apply text searches, scans, and transformations over the complete str, when it contains smuggled binary. You know that, but must have not considered it a limitation, because you know you can do any text processing on the text parts. But it is a limitation to have to keep track of it, and apply the text processing only to the parts that are text. Yes, it has been done that way, and the limitations of doing it that way led to the plethora of encodings each of which was intended to be sufficient for some problem domain, but most of which were only sufficient for a smaller problem domain than intended, especially as communications became more global in nature.

...

...
And there are no extra, confusing Latin-1 encode/decode operations required.

The "extra" encode/decode operations are mostly (perhaps all) due to examples that started from bytes and end with bytes. Of course if you assume that API and propose to do the operations using Unicode, you'll get "extra" decode/encode operations.

No, the "extra" encode/decode are from the requirement that smuggled binary use latin-1, and other binary flavors are not always latin-1.

...

...
From a higher-level perspective, I think it would be great to have a module, perhaps called "boundary" (let's call it that for now), that allow some definition syntax (augmented BNF? augmented ABNF?) to explain the format of a binary blob.

We have struct, for one. I'm not sure why you want more than that. I suppose you could go all the way to ASN.1.

struct is insufficient to capture a whole file format, with optional parts, although it suffices for fragments.

Stephen J. Turnbull

4:58 a.m.

Glenn Linderman writes:

...

On 1/13/2014 6:43 AM, Stephen J. Turnbull wrote:

...
Glenn Linderman writes:

...

...
...
"smuggled binary" (great term borrowed from a different subthread) muddies the waters of what you are dealing with.

...

...
Not really. The "mud" is one or more of the serious deficiencies. It can be removed, I believe (and Nick apparently does, too). "asciistr" is one way to try that.

...

Yes really. Use of smuggled binary means the str containing it can no longer be treated completely as a str. That is "muddier" than having a str that is only a str.

You don't seem to understand what *asciistr* is: it's a *different type* that is simultaneously compatible in operation with bytes and str, by automatically converting to whichever it is used with. If we used asciistr, str would no longer be muddy (except in cases where we would have used surrogateescape anyway). You also don't seem to understand that bytes are conceptually pure mud. Anything that is pushed to bytes because you don't know what type it is (or because at the time the program is written, the type can't be known) is no longer subject to duck-typing. So the question is "how is mud best handled?" Obviously, incorporating it in str with .decode('latin1') is inappropriate. However, if you use .decode('ascii') you have your choice of error handlers. If you use errors='strict' then no mud can get in. Use of any other error handler is obviously a "consenting adults" behavior; it should only be done when you expect that you can keep the muddy str from leaking into places where it might be passed to an I/O function. (Note that the internal processing of an application that never outputs such a str is completely conformant to the Unicode Standard. That's not a goal of Python, since surrogateescape is designed to be used on output too. But if the developer applies that standard to each *program component*, he's going to be in pretty good shape.) If you use asciistr, then you're pretty much in complete control. The exception is operations that munge individual characters (case conversion). If you have a protocol with ASCII keywords but their case is specified, you'll need to define another type to remove the case-munging methods if you want that level of safety. If, as in your proposal, bytes are tagged with descriptions, you are effectively creating types on the fly. But if the program doesn't anticipate that, they're mud. If the program doesn't anticipate all of them those descriptions that are unhandled become mud, too. ITSM that the "syntax descriptor" feature is already present in Python, and it's called "class". So, IMHO, simply converting to an appropriate Python type on input is what should be done, but in any case, I don't see how adding a "syntax descriptor" attribute to bytes is going to improve the situation significantly. Note that such a class can postpone parsing for efficiency or lack of information reasons, and store the object as bytes until needed. But this is not the same as passing around naked bytes, because the class can ensure that bytes can't get out, only parsed objects.

Guido van Rossum

5:06 a.m.

Sorry to butt in, but can you post a link to the asciistr code? Google has too many hits for other things to be useful to find it, it seems. -- --Guido van Rossum (python.org/~guido)

Ethan Furman

5:10 a.m.

On 01/13/2014 09:06 PM, Guido van Rossum wrote:

...

Sorry to butt in, but can you post a link to the asciistr code? Google has too many hits for other things to be useful to find it, it seems.

https://github.com/jeamland/asciicompat -- ~Ethan~

Ethan Furman

5:12 a.m.

On 01/13/2014 09:06 PM, Guido van Rossum wrote:

...

In contrast, here's the tests I drew up for what I thought bytes should do for us (no code, just tests): https://bitbucket.org/stoneleaf/bytestring -- ~Ethan~

Ethan Furman

5:49 a.m.

On 01/13/2014 09:12 PM, Ethan Furman wrote:

...

On 01/13/2014 09:06 PM, Guido van Rossum wrote:

...
In contrast, here's the tests I drew up for what I thought bytes should do for us (no code, just tests):

https://bitbucket.org/stoneleaf/bytestring

Ugh. Ignore for now, I need to update them to reflect the recent developments. :/ -- ~Ethan~

Glenn Linderman

5:22 a.m.

On 1/13/2014 9:06 PM, Guido van Rossum wrote:

...

Sorry to butt in, but can you post a link to the asciistr code? Google has too many hits for other things to be useful to find it, it seems.

https://github.com/jeamland/asciicompat

Glenn Linderman

6:01 a.m.

On 1/13/2014 8:58 PM, Stephen J. Turnbull wrote:

...

Glenn Linderman writes:

...
On 1/13/2014 6:43 AM, Stephen J. Turnbull wrote:

...
Glenn Linderman writes:

...
...
...
"smuggled binary" (great term borrowed from a different subthread) muddies the waters of what you are dealing with.

...
...
Not really. The "mud" is one or more of the serious deficiencies. It can be removed, I believe (and Nick apparently does, too). "asciistr" is one way to try that.

...
Yes really. Use of smuggled binary means the str containing it can no longer be treated completely as a str. That is "muddier" than having a str that is only a str.

You don't seem to understand what *asciistr* is: it's a *different type* that is simultaneously compatible in operation with bytes and str, by automatically converting to whichever it is used with. If we used asciistr, str would no longer be muddy (except in cases where we would have used surrogateescape anyway).

No, I haven't fully understood what asciistr is, only Nick's several descriptions of it. I do understand it is a different type, and can interact with both bytes and str. If it automatically converts, then it sounds terribly inefficient with long data, but I didn't hear Nick say that, but maybe I missed it. You mentioned asciistr in the snippet above, but most of what you have been writing about smuggled binary was using str... I hadn't grokked that you were now a full-fledged proponent of asciistr, and were now proposing to put your smuggled binary into asciistr.

...

You also don't seem to understand that bytes are conceptually pure mud. Anything that is pushed to bytes because you don't know what type it is (or because at the time the program is written, the type can't be known) is no longer subject to duck-typing.

If you are talking str, then bytes are mud. If you are talking bytes, then str is mud. I'm wouldn't think of "pushing something to bytes" (whatever that means) because I don't know what it is... I may manipulate bytes because I know what they are, and that is the most appropriate form for that piece of data for the present manipulations; if something is text, I want to transform the bytes to str if I need to manipulate it, parse it, or present it. If I don't know what something is, it is because it didn't meet my expectations of what it should be, and I want to present an error, which may include some representation (probably hex) of some of the bytes that cannot be understood. But if I'm "pushing to bytes", which I would interpret as creating a byte stream, then I know what I have, and I need to convert it to bytes either to store it in a file, or communicate it to another process. That's far from not knowing what it is.

...

So the question is "how is mud best handled?" Obviously, incorporating it in str with .decode('latin1') is inappropriate.

Glad to hear you say that; I thought that was what you were promoting, when you said, in an earlier message: On 1/12/2014 4:08 PM, Stephen J. Turnbull wrote:

...

Glenn Linderman writes:

...
the proposals to embed binary in Unicode by abusing Latin-1 encoding.

Those aren't "proposals", they are currently feasible techniques in Python 3 for*some* use cases.

Back to this one, though.

...

However, if you use .decode('ascii') you have your choice of error handlers. If you use errors='strict' then no mud can get in. Use of any other error handler is obviously a "consenting adults" behavior; it should only be done when you expect that you can keep the muddy str from leaking into places where it might be passed to an I/O function. (Note that the internal processing of an application that never outputs such a str is completely conformant to the Unicode Standard. That's not a goal of Python, since surrogateescape is designed to be used on output too. But if the developer applies that standard to each *program component*, he's going to be in pretty good shape.)

If you use asciistr, then you're pretty much in complete control. The exception is operations that munge individual characters (case conversion). If you have a protocol with ASCII keywords but their case is specified, you'll need to define another type to remove the case-munging methods if you want that level of safety.

The above doesn't sound like a use case I care about, much. If I get a garbled file without an accurate definition of what it contains, then I probably want to stick it in the trash. The only "processing" that can be done is to pass on the garbage to someone else, and stink up their system, and that can be done purely as bytes.

...

If, as in your proposal, bytes are tagged with descriptions, you are effectively creating types on the fly. But if the program doesn't anticipate that, they're mud.

Interpreting a file format or wire protocol requires parsing and manipulating an incoming byte stream, and converting it to useful types in the program... if it can't be converted to useful types, then why bother parsing it? So the rest of my discussion was not talking about creating types on the fly, but on a systematic way of converting a well-specified byte stream (file format, or wire protocol) to a collection of useful types, in an organized manner, that might be verifiable, rather than with ad-hoc coding. And similarly in reverse... after manipulating the objects to perform useful transformations, possibly based on user input (that's what a program does), then to write them back out to a byte stream in modified form, in an organized manner, that might be verifiable, rather than with ad-hoc coding.

...

If the program doesn't anticipate all of them those descriptions that are unhandled become mud, too. ITSM that the "syntax descriptor" feature is already present in Python, and it's called "class". So, IMHO, simply converting to an appropriate Python type on input is what should be done, but in any case, I don't see how adding a "syntax descriptor" attribute to bytes is going to improve the situation significantly.

Syntax descriptors would be a description of the substructures of a file format (think TIFF files) or wire protocol, and might allow parsing of binary files similarly to the way computer languages are parsed, producing errors when encountering mud. What you dismiss as "converting to an appropriate Python type on input" can be quite complex when for complex file formats, but it is the process of converting to such a heirarchy of Python objects that was to be described by the syntax descriptors.

...

Note that such a class can postpone parsing for efficiency or lack of information reasons, and store the object as bytes until needed. But this is not the same as passing around naked bytes, because the class can ensure that bytes can't get out, only parsed objects.

Sure, it could. My proposal is suggesting that the distribution of bytes to objects in a hierarchy might be automated in the sense of parsing the binary format, so that instead of writing "a class" for the whole, that class would be pre-written, based on the syntax description of the file, and matching that with the syntax descriptions of the component types. It is really a topic for python ideas, to flesh it out further, but it seemed related, as a use case, a class that would live on the bytes processing boundary, producing other objects, some of which may be text strings, in an organized, probably hierarchical, collection of objects.

Stephen J. Turnbull

10:57 p.m.

Ethan Furman writes:

...

Nothing else is ideal. I'll go that route if I have to. I understand that in the real world you go with what works, but in the development stage you fight for the ideal. :)

You're going to lose, because Python 3 chose a different ideal that conflicts with yours.

...

...
My reading of Nick's refusal is that %d takes a value which is semantically a number, converts it into a base-10 representation (which is semantically a *string*, not a sequence of bytes[1]) and then *encodes* that string into a series of bytes using the ASCII encoding.

That could be. And yet the bytes type already has several concessions to ASCII encoding.

No, Nick's point is that there's no encoding needed there are all, just a bunch of methods that handle numbers in the range 0-255. You can rationalize the particular choice of numbers by referring to the ASCII coded character set, and that's very useful to users. But knowledge of ASCII isn't necessary to specify these methods; they can be defined in an encoding/decoding-free way.

...

But bytes already acknowledges an ASCII bias.

True, but that bias is implemented without use of encoding or decoding. b'%d' % (123,) -> b'123' does require encoding, at the very least in the sense of type change and serialization.

Glenn Linderman

11:04 p.m.

On 1/12/2014 2:57 PM, Stephen J. Turnbull wrote:

...

...
But bytes already acknowledges an ASCII bias.

True, but that bias is implemented without use of encoding or decoding. b'%d' % (123,) -> b'123' does require encoding, at the very least in the sense of type change and serialization. b'%d' all by itself, even before using the % operator, does require encoding, at the very list in the sense of type change and serialization.

Ethan Furman

11:46 p.m.

On 01/12/2014 02:57 PM, Stephen J. Turnbull wrote:

...

Ethan Furman writes:

...
Nothing else is ideal. I'll go that route if I have to. I understand that in the real world you go with what works, but in the development stage you fight for the ideal. :)

You're going to lose, because Python 3 chose a different ideal that conflicts with yours.

Entirely possible. I didn't set out to waste anyone's time, but I wasn't around for the initial discussions so don't know the reasons behind the result, only that the result is not an appropriate boundary type despite it being what is handed around at the boundaries.

...

...
...
My reading of Nick's refusal is that %d takes a value which is semantically a number, converts it into a base-10 representation (which is semantically a *string*, not a sequence of bytes[1]) and then *encodes* that string into a series of bytes using the ASCII encoding.

That could be. And yet the bytes type already has several concessions to ASCII encoding.

No, Nick's point is that there's no encoding needed there are all, just a bunch of methods that handle numbers in the range 0-255. You can rationalize the particular choice of numbers by referring to the ASCII coded character set, and that's very useful to users. But knowledge of ASCII isn't necessary to specify these methods; they can be defined in an encoding/decoding-free way.

How can you say that with a straight face? [1] Do you really think that .title, .isalnum, and .center (to name only a few) would work the same if the assumed encoding was EBCIDC? Do you think they would do the proper transformations, or return the proper result, if the bytes they were used on were encoded Japanese?

...

...
But bytes already acknowledges an ASCII bias.

True, but that bias is implemented without use of encoding or decoding. b'%d' % (123,) -> b'123' does require encoding, at the very least in the sense of type change and serialization.

You mean like changing a number into text does? Really, this is no different. -- ~Ethan~ [1] I'm sorry to be offensive, but I have no idea how to respond to that that acknowledges my complete astonishment that you would say such a thing.

Stephen J. Turnbull

4:27 a.m.

Ethan Furman writes:

...

On 01/12/2014 02:57 PM, Stephen J. Turnbull wrote:

...

...
No, Nick's point is that there's no encoding needed there are all, just a bunch of methods that handle numbers in the range 0-255. You can rationalize the particular choice of numbers by referring to the ASCII coded character set, and that's very useful to users. But knowledge of ASCII isn't necessary to specify these methods; they can be defined in an encoding/decoding-free way.

How can you say that with a straight face? [1]

Because I showed you code that does it. Did you see an .encode or a .decode in there?

...

Do you really think that .title, .isalnum, and .center (to name only a few) would work the same if the assumed encoding was EBCIDC?

Yes, yes, and yes. The numbers involved would change, and the test for finding letters would be different (and more complicated IIRC). The only one to worry about is .title, but neither ASCII nor EBCDIC has confused or multiple letter titlecase.

...

Do you think they would do the proper transformations, or return the proper result, if the bytes they were used on were encoded Japanese?

That depends on which Japanese encoding. It would work correctly on UTF-8 and on EUC-JP (packed), and not on any of the others. But you wouldn't consider that "ASCII-encoded text", would you?

...

...
...
But bytes already acknowledges an ASCII bias.

True, but that bias is implemented without use of encoding or decoding. b'%d' % (123,) -> b'123' does require encoding, at the very least in the sense of type change and serialization.

You mean like changing a number into text does? Really, this is no different.

Precisely. "There should be one- and preferably only one -way to do it." The one way uses text, so preferably bytes shouldn't.

Ethan Furman

5:22 a.m.

On 01/12/2014 08:27 PM, Stephen J. Turnbull wrote:

...

Ethan Furman writes:

...
On 01/12/2014 02:57 PM, Stephen J. Turnbull wrote:

I didn't trim enough to make my point clear. My apologies.

...

...
...
But knowledge of ASCII isn't necessary to specify these methods; they can be defined in an encoding/decoding-free way.

Perhaps you meant "use the methods". I meant "write the methods". You cannot write .upper for the bytes type without knowing what encoding has been used / is represented by those bytes. And quite frankly, if you use those methods on bytes without knowing (1) which encoding is represented by the bytes and (2) that the function you are calling is meant to work with that encoding... well, you deserve what you get.

...

...
How can you say that with a straight face?

Because I showed you code that does it. Did you see an .encode or a .decode in there?

No, I didn't. I saw numbers representing bytes representing text that has been encoded in the ASCII codec. If you didn't know it was ASCII, you couldn't write that function. Even though you don't have to call encode or decode if working directly with encoded bytes, you still have to know what the encoding is to do it correctly.

...

...
Do you really think that .title, .isalnum, and .center (to name only a few) would work the same if the assumed encoding was EBCIDC?

I phrased that poorly. If the byte stream was EBCIDC-encoded, and we called the current .method_which_assumes_ASCII on it, would we get the proper results?

...

The numbers involved would change, and the test for finding letters would be different (and more complicated IIRC).

And you have actually just made my point. If the bytes in question were EBCIDC-encoded, we could write a function for it because we know what it looks like as encoded bytes. Then we could be debating the merits of working directly with EBCIDC-encoded text instead of ASCII-encoded text. ;)

...

"There should be one- and preferably only one -way to do it." The one way uses text, so preferably bytes shouldn't.

You forgot the word "obvious". -- ~Ethan~

INADA Naoki

7:21 p.m.

I want to add one more PoV: small performance regression, especially on Python 2. Because programs that needs byte formatting may be low level and used heavily from application. Many programs uses one source approach to support Python 3. And supporting Python 3 should not means large performance regression on Python 2. In Python 2: In [1]: def int_to_bytes(n): ...: return unicode(n).encode('ascii') ...: In [2]: %timeit int_to_bytes(42) 1000000 loops, best of 3: 691 ns per loop In [3]: %timeit b'Content-Type: ' + int int int_to_bytes intern In [3]: %timeit b'Content-Type: ' + int_to_bytes(42) 1000000 loops, best of 3: 737 ns per loop In [4]: %timeit b'Content-Type: %d' % 42 10000000 loops, best of 3: 20.2 ns per loop In [5]: %timeit (u'Content-Type: %d' % 42).encode('ascii') 1000000 loops, best of 3: 381 ns per loop In Python 3: In [1]: def int_to_bytes(n): ...: return str(n).encode('ascii') ...: In [2]: %timeit int_to_bytes(42) 1000000 loops, best of 3: 612 ns per loop In [3]: %timeit b'Content-Type: ' + int_to_bytes(42) 1000000 loops, best of 3: 668 ns per loop In [4]: %timeit ('Content-Type: %d' % 42).encode('ascii') 1000000 loops, best of 3: 326 ns per loop

...

I'm arguing from three PoVs:

...
1) 2 & 3 compatible code base 2) having the bytes type /be/ the boundary type 3) readable code

The only one of these that I can see being in any way an argument against

def int_to_bytes(n): return str(n).encode('ascii')

b'Content Length: ' + int_to_bytes(len(binary_data))

is (3), and that's largely subjective. Personally, I see very little difference between the above and %d-interpolation in terms of *readability*. Brevity, certainly %d wins. But that's not important on its own, and I'd argue that my version is more clear in terms of describing the intent (and would be even better if I wasn't rubbish at thinking of function names, or if this wasn't in isolation, and more application-focused functions were used).

...
It seems to me the core of Nick's refusal is the (and I agree!) rejection of bytes interpolation returning unicode -- but that's not what I'm asking for! I'm asking for it to return bytes, with the interpolated data (in the case if %d, %s, etc) being strictly-ASCII encoded.

My reading of Nick's refusal is that %d takes a value which is semantically a number, converts it into a base-10 representation (which is semantically a *string*, not a sequence of bytes[1]) and then *encodes* that string into a series of bytes using the ASCII encoding. That is *two* semantic transformations, and one (the ASCII encoding) is *implicit*. Specifically, it's implicit because (a) the normal reading of %d is "produce the base-10 representation of a number, and a base-10 representation is a *string*, and (b) because nowhere has ASCII been mentioned (why not UTF16? that would be entirely plausible for a wchar-based environment like Windows). And a core principle of the bytes/text separation in Python 3 is that encoding should never happen implicitly.

By the way, I should point out that I would never have understood *any* of the ideas involved in this thread before Python 3 forced me to think about Unicode and the distinction between text and bytes. And yet, I now find myself, in my (non-Python) work environment, being the local expert whenever applications screw up text encodings. So I, for one, am very grateful for Python 3's clear separation of bytes and text. (And if I sometimes come across as over-dogmatic, I apologise - put it down to the enthusiasm of the recent convert :-))

Paul

[1] If you cannot see that there's no essential reason why the base-10 representation '123' should correspond to the bytes b'\x31\x32\x33' then you are probably not old enough to have started programming on EBCDIC-based computers :-) _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/songofacandy%40gmail.com

-- INADA Naoki <songofacandy@gmail.com>

Greg Ewing

10:10 p.m.

Paul Moore wrote:

...

On 12 January 2014 18:26, Ethan Furman <ethan@stoneleaf.us> wrote:

...
I'm arguing from three PoVs: 1) 2 & 3 compatible code base 2) having the bytes type /be/ the boundary type 3) readable code

The only one of these that I can see being in any way an argument against

def int_to_bytes(n): return str(n).encode('ascii')

b'Content Length: ' + int_to_bytes(len(binary_data))

is (3),

I think the readability argument becomes a bit sharper when you consider more complex examples, e.g. if I have a tuple of 3 floats that I want to put into a PDF file, then b"%f %f %f" % my_floats is considerably clearer than b" ".join((float_to_bytes(f) for f in my_floats))

...

My reading of Nick's refusal is that %d takes a value which is semantically a number, converts it into a base-10 representation (which is semantically a *string*, not a sequence of bytes[1]) and then *encodes* that string into a series of bytes using the ASCII encoding. That is *two* semantic transformations, and one (the ASCII encoding) is *implicit*. Specifically, it's implicit because (a) the normal reading of %d is "produce the base-10 representation of a number, and a base-10 representation is a *string*, and (b) because nowhere has ASCII been mentioned

It's indicated (I won't say "implied", see below) by the fact that we're interpolating it into a bytes object rather than a string. This is no more or less implicit than the fact that when we write b"ABC" then we're saying that those characters are to be encoded in ASCII, and not EBCDIC or UTF-16 or... BTW, there's a problem with bandying around the words "implicit" and "explicit", because they depend on your frame of reference. For example, one person might say that the fact that b"%s" encodes into ASCII is implicit, because ASCII isn't written down in the code anywhere. But another person might say it's explicit, because the manual explicitly says that stuff interpolated into a bytes object is encoded as ASCII. So arguments of the form "X is bad because it's not explicit" are prone to getting people talking past each other. -- Greg

Paul Moore

10:29 p.m.

On 12 January 2014 22:10, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:

...

I think the readability argument becomes a bit sharper when you consider more complex examples, e.g. if I have a tuple of 3 floats that I want to put into a PDF file, then

b"%f %f %f" % my_floats

is considerably clearer than

b" ".join((float_to_bytes(f) for f in my_floats))

Hmm, I'm not sure I'd agree. I'd quote "explicit is better than implicit", but given comments below, that would be a mistake :-) Let's just leave it that I'd probably wrap the whole thing in a float_list(floats) function in my application, and not *care* how it was implemented. One thing that this does bring up, though, is that all the talk is about %-formatting. Do the people who are arguing for numeric formatting have views on what (if any) features will be included in bytes.format()? It seems to me that recasting many of the discussions using format() make it much less "obvious" that adding the features to bytes formatting is a reasonable thing to do. I won't give specific examples, because I would be putting words into people's mouths. But I *would* say that any genuine proposal for numeric formatting in bytes should be cast as a formal PEP and explicitly document both % and format() behaviours.

...

It's indicated (I won't say "implied", see below) by the fact that we're interpolating it into a bytes object rather than a string.

This is no more or less implicit than the fact that when we write

b"ABC"

then we're saying that those characters are to be encoded in ASCII, and not EBCDIC or UTF-16 or...

That's a fair point, and one I had not taken into consideration.

...

BTW, there's a problem with bandying around the words "implicit" and "explicit", because they depend on your frame of reference. For example, one person might say that the fact that b"%s" encodes into ASCII is implicit, because ASCII isn't written down in the code anywhere. But another person might say it's explicit, because the manual explicitly says that stuff interpolated into a bytes object is encoded as ASCII.

In my defense, I would say that I was trying to clarify Nick's objections, and it's entirely possible I misrepresented this aspect of them. Personally, I agree that it's not as black and white as simply saying "numeric formatting is wrong", but I think that the fact that %d et al represent a "double transformation" (from number to string representation to encoded bytes) is the differentiating factor here. Proposals that do nothing but interpolation are essentially convenience wrappers for various combinations of concatenation and join. Adding "double transformation" formatting codes is a step change, and needs to be explicitly acknowledged and justified. (If you *do* manage to justify such codes, there's a secondary question of precisely what codes should be supported, but we can start by getting agreement that the *class* of codes is allowed). PEP 460 explicitly excludes anything but pure interpolation.

...

So arguments of the form "X is bad because it's not explicit" are prone to getting people talking past each other.

Fair point. I hope my above paragraph clarifies my position somewhat better. Paul

Emile van Sebille

7:30 p.m.

On 01/12/2014 09:26 AM, Paul Moore wrote:

...

Can you give an example of code that is *nearly* acceptable to you, which works in Python 2 and 3 today, and explain what improvements you would like to see to it in order to use it instead of waiting for a core change?

I'm not a developer, but I'm trying to understand how in v3 I accomplish what in v2 is easy: len(open('chars','wb').write("".join(map (chr,range(256)))).read()) What's the v3 equivalent? Emile

Paul Moore

7:46 p.m.

On 12 January 2014 19:30, Emile van Sebille <emile@fenx.com> wrote:

...

len(open('chars','wb').write("".join(map (chr,range(256)))).read())

Python 2:

...

...
...
len(open('chars','wb').write("".join(map (chr,range(256)))).read()) Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'NoneType' object has no attribute 'read'

I could be facetous and say "None.read", but more seriously, what are you trying to say here? How do I write a 256-byte file with one byte for each value? bytes(range(256)) gives you the bytestring you want. I simply don't see your point here.

...

...
And yet I still don't follow what you *want*. Unless it's that b'%d' % (12,) must work and give b'12', and nothing else is acceptable.

Nothing else is ideal. I'll go that route if I have to. I understand that in the real world you go with what works, but in the development stage you fight for the ideal. :)

OK, but can you fight by giving arguments as to why it's better than the plethora of alternatives that have been suggested? Or counter-arguments to the objections that have been raised to the proposal? Paul

Emile van Sebille

7:47 p.m.

On 01/12/2014 11:30 AM, Emile van Sebille wrote:

...

On 01/12/2014 09:26 AM, Paul Moore wrote:

...
Can you give an example of code that is *nearly* acceptable to you, which works in Python 2 and 3 today, and explain what improvements you would like to see to it in order to use it instead of waiting for a core change?

I'm not a developer, but I'm trying to understand how in v3 I accomplish what in v2 is easy:

len(open('chars','wb').write("".join(map (chr,range(256)))).read())

my bad :

...

...
...
open('chars','wb').write("".join(map (chr,range(256)))) len(open('chars','rb').read()) 256

...

What's the v3 equivalent?

Emile

Georg Brandl

8:05 p.m.

Am 12.01.2014 20:30, schrieb Emile van Sebille:

...

On 01/12/2014 09:26 AM, Paul Moore wrote:

...
Can you give an example of code that is *nearly* acceptable to you, which works in Python 2 and 3 today, and explain what improvements you would like to see to it in order to use it instead of waiting for a core change?

I'm not a developer, but I'm trying to understand how in v3 I accomplish what in v2 is easy:

len(open('chars','wb').write("".join(map (chr,range(256)))).read())

What's the v3 equivalent?

That's actually very easy and shows a strength of the bytes type, since there's no text involved: open('chars', 'wb').write(bytes(range(256))) Georg

Greg Ewing

10:12 p.m.

Ethan Furman wrote:

...

Your asciistr, which sometimes returns bytes and sometimes returns text, is absolutely *not* what we want.

The kind of third-party thing that *might* fill the bill would be a *function*: bytesformat(b"Content-Length: %d", length) that implements all the %-specifiers we're asking for. -- Greg

Mark Lawrence

10:32 p.m.

On 12/01/2014 17:03, Ethan Furman wrote:

...

On 01/12/2014 08:21 AM, Ethan Furman wrote:

...
On 01/12/2014 08:09 AM, Nick Coghlan wrote:

...
On 13 Jan 2014 01:22, "Kristján Valur Jónsson" wrote:

...
Imho, this is not equivalent to re-introducing automatic type conversion between binary/unicode, it is adding a specific convenience function for explicitly asking for ASCII encoding.

It is not explicit, it is implicit - whether or not the resulting string assumes ASCII compatibility or not depends on whether you pass a binary value (no assumption) or a string value (assumes ASCII compatibility).

Nick, I don't understand what you are saying here. Are you saying that the result of b'%s' % var may be either a bytes object or a str object? Because that would be wrong -- it would always be a bytes object.

Okay, I just went and took a closer look at the asciistr type [1]. For what it's worth I don't think this is Antoine's understanding of what we [2] are asking for, nor is it what we are asking for (I'm sure Antoine will correct me if I'm wrong. ;)

We know full well the difference between unicode and bytes, and we know full well that numbers and much of the text we need has an ASCII (bytes!) representation. When we do a b'Content Length: %d' % len(binary_data) we are expecting to get back a bytes object, /not/ a unicode object.

Your asciistr, which sometimes returns bytes and sometimes returns text, is absolutely *not* what we want.

I've just tried asciistr using your test code (having corrected the typo, it's assertIsInstance, not assertIsinstance :) and it looks like a very good starting point. Have you, or anyone else for that matter, actually tried asciistr out?

...

-- ~Ethan~

[1] https://github.com/jeamland/asciicompat [2] the dbf and pdf folks, at least

-- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence

Ethan Furman

5:33 a.m.

On 01/12/2014 02:32 PM, Mark Lawrence wrote:

...

I've just tried asciistr using your test code (having corrected the typo, it's assertIsInstance, not assertIsinstance :) and it looks like a very good starting point. Have you, or anyone else for that matter, actually tried asciistr out?

Ah, thanks for that fix, and thanks for trying it out. Um, how exactly did you try it out? This is what I did: bytestring_test.py ================== from asciicompat import asciistr as bytestring ... ================== ethan@media:~/source/bytestring$ python3.4 bytestring_test.py .F.FFF ====================================================================== FAIL: test_bytestring_will_accept_codepoints_in_latin1 (__main__.TestByteString) ---------------------------------------------------------------------- Traceback (most recent call last): File "bytestring_test.py", line 30, in test_bytestring_will_accept_codepoints_in_latin1 self.assertEqual(bytestring(char), bytes([ch])) AssertionError: '\x00' != b'\x00' ====================================================================== FAIL: test_from_str_plus_str (__main__.TestByteString) ---------------------------------------------------------------------- Traceback (most recent call last): File "bytestring_test.py", line 9, in test_from_str_plus_str self.assertEqual(result, b'hello world') AssertionError: 'hello world' != b'hello world' ====================================================================== FAIL: test_interpolation (__main__.TestByteString) ---------------------------------------------------------------------- Traceback (most recent call last): File "bytestring_test.py", line 33, in test_interpolation self.assertEqual(bytestring('Content-Length: %d') % 71, b'Content-Length: 71') AssertionError: 'Content-Length: 71' != b'Content-Length: 71' ====================================================================== FAIL: test_str_plus_from_str (__main__.TestByteString) ---------------------------------------------------------------------- Traceback (most recent call last): File "bytestring_test.py", line 14, in test_str_plus_from_str result = 'hello' + bytestring('world') AssertionError: TypeError not raised ---------------------------------------------------------------------- Ran 6 tests in 0.002s FAILED (failures=4) Four out of six failed is not a good beginning. :( -- ~Ethan~

Kristján Valur Jónsson

4:52 p.m.

Now you're just splitting hairs, Nick. An explicit operator, %s, _defined_ to be "encode a string object using strict ascii", how is that any less explicit than the .encode('ascii', 'strict') spelt out in full? The language is full of constructs that are shorthands for others, more lengthy but equivalent things. I mean, basically what I am suggesting is that in addition to %b with def helper(o): return str(o).encode('ascii', 'strict') b'foo%bbar'%(helper(myobj), ) you have b'foo%sbar'%(myobj, ) There is no "data driven change in assumptions." Just an interpolation operator with a clearly defined meaning. I don't think anyone is trying to compromise the text model. All people are asking for is that the _boundary_ is made a little easier to deal with. K ________________________________ From: Nick Coghlan [ncoghlan@gmail.com] Sent: Sunday, January 12, 2014 16:09 To: Kristján Valur Jónsson Cc: python-dev@python.org; Georg Brandl Subject: Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake It is not explicit, it is implicit - whether or not the resulting string assumes ASCII compatibility or not depends on whether you pass a binary value (no assumption) or a string value (assumes ASCII compatibility). This kind of data driven change in assumptions about correctness is utterly unacceptable in the core text and binary types in Python 3.

Paul Moore

5:04 p.m.

On 12 January 2014 16:52, Kristján Valur Jónsson <kristjan@ccpgames.com> wrote:

...

I mean, basically what I am suggesting is that in addition to %b with

def helper(o): return str(o).encode('ascii', 'strict')

b'foo%bbar'%(helper(myobj), )

you have

b'foo%sbar'%(myobj, )

But that's not what the current PEP says. It uses %s for interpolating bytes values. It looks like you're saying that b'abc %s' % (b'def') will *not* produce b'abc def', but rather will produce b'abc b\'def\'' (because str(b'def'') is "b'def'"). If that's what you're saying, then fine, but it's a different PEP and I for one am -1 specifically because of the behaviour I show above. Paul

Kristján Valur Jónsson

9:37 p.m.

Right. I'm saying, let's support two interpolators only: %b interpolates a bytes object (or one supporting the charbuffer interface) into a bytes object. %s interpolates a str object by first converting to a bytes object using strict ascii conversion. This makes it very explicit what we are trying to do. I think that using %s to interpolate a bytes object like the current PEP does is a bad idea, because %s already means 'str' elsewhere in the language, both in 2.7 and 3.x As for the case you mention: b"abc %s" % (b"def",) -> b"abc def" b"abc %s" % (b"def",) -> b"abc b\"def\"" # because str(bytesobject) == repr(bytesobject) This is perfectly fine, imho. Let's not overload %s to mean "bytes" in format strings if those format strnings are in fact not strings byt bytes. That way madness lies. K ________________________________________ From: Paul Moore [p.f.moore@gmail.com] Sent: Sunday, January 12, 2014 17:04 To: Kristján Valur Jónsson Cc: Nick Coghlan; Georg Brandl; python-dev@python.org Subject: Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake On 12 January 2014 16:52, Kristján Valur Jónsson <kristjan@ccpgames.com> wrote: But that's not what the current PEP says. It uses %s for interpolating bytes values. It looks like you're saying that b'abc %s' % (b'def') will *not* produce b'abc def', but rather will produce b'abc b\'def\'' (because str(b'def'') is "b'def'").

Ethan Furman

10:01 p.m.

On 01/12/2014 01:37 PM, Kristján Valur Jónsson wrote:

...

Right. I'm saying, let's support two interpolators only: %b interpolates a bytes object (or one supporting the charbuffer interface) into a bytes object. %s interpolates a str object by first converting to a bytes object using strict ascii conversion.

This makes it very explicit what we are trying to do. I think that using %s to interpolate a bytes object like the current PEP does is a bad idea, because %s already means 'str' elsewhere in the language, both in 2.7 and 3.x

As for the case you mention: b"abc %s" % (b"def",) -> b"abc def" b"abc %s" % (b"def",) -> b"abc b\"def\"" # because str(bytesobject) == repr(bytesobject)

This is perfectly fine, imho. Let's not overload %s to mean "bytes" in format strings if those format strnings are in fact not strings byt bytes. That way madness lies.

You didn't say, but I'm guessing you mean the second one is fine? if 2/3 compatible code is the goal, the first should be what we get. -- ~Ethan~

Mark Shannon

5:06 p.m.

On 12/01/14 16:52, Kristján Valur Jónsson wrote:

...

Now you're just splitting hairs, Nick.

An explicit operator, %s, _defined_ to be "encode a string object using strict ascii",

I don't like this because '%s' reads to me as "insert *string* here". I think '%a' which reads as "encode as ASCII and insert here" would be better.

...

how is that any less explicit than the .encode('ascii', 'strict') spelt out in full? The language is full of constructs that are shorthands for others, more lengthy but equivalent things.

I mean, basically what I am suggesting is that in addition to %b with

def helper(o):

return str(o).encode('ascii', 'strict')

b'foo*%b*bar'%(helper(myobj), )

you have

b'foo*%s*bar'%(myobj, )

There is no "data driven change in assumptions." Just an interpolation operator with a clearly defined meaning.

I don't think anyone is trying to compromise the text model. All people are asking for is that the _boundary_ is made a little easier to deal with.

K

------------------------------------------------------------------------ *From:* Nick Coghlan [ncoghlan@gmail.com] *Sent:* Sunday, January 12, 2014 16:09 *To:* Kristján Valur Jónsson *Cc:* python-dev@python.org; Georg Brandl *Subject:* Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake

It is not explicit, it is implicit - whether or not the resulting string assumes ASCII compatibility or not depends on whether you pass a binary value (no assumption) or a string value (assumes ASCII compatibility). This kind of data driven change in assumptions about correctness is utterly unacceptable in the core text and binary types in Python 3.

_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/mark%40hotpy.org

Mark Lawrence

5:23 p.m.

On 12/01/2014 17:06, Mark Shannon wrote:

...

On 12/01/14 16:52, Kristján Valur Jónsson wrote:

...
Now you're just splitting hairs, Nick.

An explicit operator, %s, _defined_ to be "encode a string object using strict ascii",

I don't like this because '%s' reads to me as "insert *string* here". I think '%a' which reads as "encode as ASCII and insert here" would be better.

I entirely agree. This would also parallel the conversion flags given here http://docs.python.org/3/library/string.html#format-string-syntax, I quote "Three conversion flags are currently supported: '!s' which calls str() on the value, '!r' which calls repr() and '!a' which calls ascii()". -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence

Greg Ewing

10:12 p.m.

Mark Lawrence wrote:

...

I entirely agree. This would also parallel the conversion flags given here http://docs.python.org/3/library/string.html#format-string-syntax, I quote "Three conversion flags are currently supported: '!s' which calls str() on the value, '!r' which calls repr() and '!a' which calls ascii()".

Except that ascii() does something rather different -- it's a variation on repr() rather than str(), and it doesn't imply any encoding operation. I think this parallel would be more confusing than helpful. -- Greg

Kristján Valur Jónsson

9:37 p.m.

+1, even better. ________________________________________ From: Python-Dev [python-dev-bounces+kristjan=ccpgames.com@python.org] on behalf of Mark Shannon [mark@hotpy.org] Sent: Sunday, January 12, 2014 17:06 To: python-dev@python.org Subject: Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake On 12/01/14 16:52, Kristján Valur Jónsson wrote:

...

Now you're just splitting hairs, Nick.

An explicit operator, %s, _defined_ to be "encode a string object using strict ascii",

I don't like this because '%s' reads to me as "insert *string* here". I think '%a' which reads as "encode as ASCII and insert here" would be better.

Greg Ewing

9:25 p.m.

Nick Coghlan wrote:

...

On 13 Jan 2014 01:22, "Kristján Valur Jónsson" <kristjan@ccpgames.com <mailto:kristjan@ccpgames.com>> wrote:

...
Well, my suggestion would that we _should_ make it work, by having the %s format specifyer on bytes objects mean: str(arg).encode('ascii', 'strict')

It is not explicit, it is implicit - whether or not the resulting string assumes ASCII compatibility or not depends on whether you pass a binary value (no assumption) or a string value (assumes ASCII compatibility).

How do you make that out? As far as I can see, Kristjan's proposal will *always* call str() on the argument of a %s format, regardless of its type. The *result* of that str() is then *required* (not assumed) to be encodable as ascii. I don't see any type-dependent changes in behaviour here. Interpolating a bytes object as-is, without a conversion to text, should be done by a different format specifier, such as %b. All text/bytes conversions are then explicit: if you write %s, then you're encoding something as ascii, but if you write %b, you're just inserting something that's already binary. -- Greg

Kristján Valur Jónsson

7:40 p.m.

Hi there. How about a compromise? Personally, I think adding the full complement of integer/float formatting to bytes is a bit over the top. How about just supporting two format specifiers? %b : interpolate a bytes object. If it doesn't have the buffer interface, error. %s : interpolate a str object, encoded to ASCII using 'strict' conversion. This should cover the most common use cases. In particular, you could do this: Headers.append('Content-Length: %s'%(len(data),)) And then subsequently: Packet = b'%b%b'%(b"join(headers), data) For more complex formatting, you delegate to the more capable string class, but benefit from automatic ASCII conversion: Data = b"percentage = %s" % ("%4.2f" % (value,)) I think interpolating bytes objecst is very important. And support for automatic ASCII conversion in the process will help us cover all of the numeric use cases. K -----Original Message----- From: Python-Dev [mailto:python-dev-bounces+kristjan=ccpgames.com@python.org] On Behalf Of Victor Stinner Sent: 11. janúar 2014 17:42 To: Python Dev Subject: [Python-Dev] PEP 460: allowing %d and %f and mojibake Hi, I'm in favor of adding support of formatting integer and floatting point numbers in the PEP 460: %d, %u, %o, %x, %f with padding and precision (%10d, %010d, %1.5f) and sign (%-i, %+i) but without alternate format ("{:#x}"). %s would also accept int and float for convenience.

Serhiy Storchaka

9:01 p.m.

11.01.14 21:40, Kristján Valur Jónsson написав(ла):

...

How about a compromise? Personally, I think adding the full complement of integer/float formatting to bytes is a bit over the top. How about just supporting two format specifiers? %b : interpolate a bytes object. If it doesn't have the buffer interface, error. %s : interpolate a str object, encoded to ASCII using 'strict' conversion.

%b is not supported in Python 2.7. And compatibility with Python 2.7 is only the purpose of this feature.

Georg Brandl

9:10 p.m.

Am 11.01.2014 22:01, schrieb Serhiy Storchaka:

...

11.01.14 21:40, Kristján Valur Jónsson написав(ла):

...
How about a compromise? Personally, I think adding the full complement of integer/float formatting to bytes is a bit over the top. How about just supporting two format specifiers? %b : interpolate a bytes object. If it doesn't have the buffer interface, error. %s : interpolate a str object, encoded to ASCII using 'strict' conversion.

%b is not supported in Python 2.7. And compatibility with Python 2.7 is only the purpose of this feature.

Not "only", but it is certainly an important one. Georg

Kristján Valur Jónsson

2:11 a.m.

No, I don't think it is. The purpose is to make it easier to work with bytes objects. There can be no python 2 compatibility when it comes to bytes/unicode conversion. ________________________________________ From: Python-Dev [python-dev-bounces+kristjan=ccpgames.com@python.org] on behalf of Serhiy Storchaka [storchaka@gmail.com] Sent: Saturday, January 11, 2014 21:01 To: python-dev@python.org Subject: Re: [Python-Dev] PEP 460: allowing %d and %f and mojibake 11.01.14 21:40, Kristján Valur Jónsson написав(ла):

...

How about a compromise? Personally, I think adding the full complement of integer/float formatting to bytes is a bit over the top. How about just supporting two format specifiers? %b : interpolate a bytes object. If it doesn't have the buffer interface, error. %s : interpolate a str object, encoded to ASCII using 'strict' conversion.

%b is not supported in Python 2.7. And compatibility with Python 2.7 is only the purpose of this feature.

Lennart Regebro

3:48 p.m.

On Sat, Jan 11, 2014 at 8:40 PM, Kristján Valur Jónsson <kristjan@ccpgames.com> wrote:

...

Hi there. How about a compromise? Personally, I think adding the full complement of integer/float formatting to bytes is a bit over the top. How about just supporting two format specifiers? %b : interpolate a bytes object. If it doesn't have the buffer interface, error. %s : interpolate a str object, encoded to ASCII using 'strict' conversion.

This should cover the most common use cases. In particular, you could do this:

Headers.append('Content-Length: %s'%(len(data),))

And then subsequently: Packet = b'%b%b'%(b"join(headers), data)

For more complex formatting, you delegate to the more capable string class, but benefit from automatic ASCII conversion:

Data = b"percentage = %s" % ("%4.2f" % (value,))

Although nice and clean as principle, I think it makes for somewhat messy code. I'm in favor of having float and integer specifiers as well. I'm also for including %s, because it makes moving from Python 2 easier. But it should definitely error out if you try to feed it a non-ascii string. //Lennart

Nick Coghlan

3:09 a.m.

On 12 Jan 2014 03:44, "Victor Stinner" <victor.stinner@gmail.com> wrote:

...

Hi,

I'm in favor of adding support of formatting integer and floatting point numbers in the PEP 460: %d, %u, %o, %x, %f with padding and precision (%10d, %010d, %1.5f) and sign (%-i, %+i) but without alternate format ("{:#x}"). %s would also accept int and float for convenience.

int and float subclasses would not be handled differently, their __str__ and __format__ would be ignored.

Other int-like and float-like types (ex: defining __int__ or __index__) are not supported. Explicit cast would be required.

asciistr will support the *full* text formatting API, so I don't see any reason to add this complexity to the core bytes type. However, I like the basic binary interpolation feature proposed by the current version of the PEP - it's a nice convenience method that doesn't compromise the text model by introducing implicit serialisation of other types (whether text or numbers). For Python 2 folks trying to grok where the "bright line" is in terms of the Python 3 text model: if your proposal includes *any* kind of implicit serialisation of non binary data to binary, it is going to be rejected as an addition to the core bytes type. If it avoids crossing that line (as the buffer-API-only version of PEP 460 does), then we can talk. Folks that want implicit serialisation (and I agree it has its uses) should go help Benno get asciistr up to speed. Cheers, Nick.

...

For %s, the choice between string and number is made using "(PyLong_Check() || PyFloat_Check())".

If you agree, I will modify the PEP. If Antoine disagree, I will fork the PEP 460 ;-)

---

%s should not support precision (ex: %.100s), use Unicode for that.

---

The PEP 460 should not reintroduce bytes+unicode, implicit decoding or implement encoding.

b'x=%s' % 10 is well defined, it's pure bytes. If you consider that bytes should not contain text, why does the bytes type have methods like isalpha() or upper()? And why binary files have a readline() method? A "line" doesn't mean anything in pure bytes.

It's an example of "practicality beats purity". Python 3 should not enforce Unicode if the developers *chose* to use bytes to handle mixed binary/text protocols like HTTP.

But I'm against of adding "%r" and "%a" because they use Unicode and would require an implicit encoding. type(ascii(obj)) is str, not bytes. If you really want to use repr() and ascii(), encode the result explicitly.

Victor _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe:

https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com

Ethan Furman

6:40 p.m.

On 01/11/2014 07:09 PM, Nick Coghlan wrote:

...

Folks that want implicit serialisation (and I agree it has its uses) should go help Benno get asciistr up to speed.

asciistr is not what I'm looking for in the way of a boundary type. I have created a 'bytestring'[1] repository which has the tests for what I am looking for. Hopefully that will get rid of some confusion, at least. -- ~Ethan~ [1] https://bitbucket.org/stoneleaf/bytestring

Scott Dial

3:26 a.m.

On 2014-01-11 22:09, Nick Coghlan wrote:

...

For Python 2 folks trying to grok where the "bright line" is in terms of the Python 3 text model: if your proposal includes *any* kind of implicit serialisation of non binary data to binary, it is going to be rejected as an addition to the core bytes type. If it avoids crossing that line (as the buffer-API-only version of PEP 460 does), then we can talk.

To take such a hard-line stance, I would expect you to author a PEP to strip the ASCII conveniences from the bytes and bytearray types. Otherwise, I find it a bit schizophrenic to argue that methods like lower, upper, title, and etc. don't implicitly assume encoding:

...

...
...
a = "scott".encode('utf-16') b = a.title() c = b.decode('utf-16') 'SCOTT'

So, clearly title() not only depends on the bytes characters encoded in a superset of ASCII characters, it depends on the bytes being a sequence of ASCII characters, which looks an awful lot like an operation on an implicit encoded string.

...

...
...
b"文字化け" File "<stdin>", line 1 SyntaxError: bytes can only contain ASCII literal characters.

There is an implicit serialization right there. My terminal is utf8 (or even if my source encoding is utf8), so why would that not be: b'\xe6\x96\x87\xe5\xad\x97\xe5\x8c\x96\xe3\x81\x91' I sympathize with Ethan that the bytes and bytearray types already seem to concede that bytes is the type you want to use for 7-bit ASCII manipulations. If that is not what we want, then we are not doing a good job communicating that to developers with the API. At the onset, the bytes literal itself seems to be an attractive nuisance as it gives a nod to using bytes for ASCII character sequences (a.k.a ASCII strings). Regards, -Scott -- Scott Dial scott@scottdial.com

Guido van Rossum

3:49 a.m.

Those still arguing on this thread might want to look at the thread "PEP 460 reboot". -- --Guido van Rossum (python.org/~guido)

Jim J. Jewett

10:49 p.m.

New subject: PEP 460 -- adding explicit assumptions

As best I can tell, some people (apparently including Guido and PEP author Antoine) are taking some assumptions almost for granted, while other people (including me, before Nick's messages) were not assuming them at all. Since these assumptions (or, possibly, rejections of them?) are likely to decide the outcome, the assumptions should be explicit in the PEP. (1) The bytes-related classes do include methods that are only useful when the already-contained data is encoded ASCII. They do not (and will not) include any operations that *require* an encoding assumption. This implies that no non-bytes data can be added without an explicit encoding. (1a) Not even by assuming ASCII with strict error handling. (1b) Not even for numbers, where ASCII/strict really is sufficient. Note that this doesn't rule out a solution where objects (or maybe just numbers and ASCII-kind text) provide their own encoding to bytes -- but that has to be done by the objects themselves, not by the bytes container or by the interpreter. (2) Most python programmers are still in the future. So an API that confuses people who are still learning about Unicode and the text model is bad -- even if it would work fine for those who do already understand it. -jJ -- If there are still threading problems with my replies, please email me with details, so that I can try to resolve them. -jJ

4058

Age (days ago)

4062

Last active (days ago)

List overview

Download

107 comments

22 participants

participants (22)

Antoine Pitrou
Barry Warsaw
Emile van Sebille
Ethan Furman
Georg Brandl
Glenn Linderman
Greg Ewing
Guido van Rossum
INADA Naoki
Jim J. Jewett
Kristján Valur Jónsson
Lennart Regebro
Mark Lawrence
Mark Shannon
Nick Coghlan
Paul Moore
Scott Dial
Serhiy Storchaka
Stephen J. Turnbull
Steven D'Aprano
Tres Seaver
Victor Stinner

PEP 460: allowing %d and %f and mojibake

Emile van Sebille

Mark Lawrence

Kristján Valur Jónsson

Emile van Sebille

Emile van Sebille

Mark Lawrence

Kristján Valur Jónsson

Kristján Valur Jónsson

Mark Lawrence

Kristján Valur Jónsson

Kristján Valur Jónsson

Kristján Valur Jónsson

Scott Dial

tags

participants (22)