Migration from Python 2.7 and bytes formatting
As I see it, there are two separate goals in adding formatting methods to bytes. One is to make it easier to write new programs that manipulate byte data. Another is to make it easier to upgrade Python 2.x programs to Python 3.x. Here is an idea to better address these separate goals. Introduce %-interpolation for bytes. Support the following format codes to aid in writing new code: %b: insert arbitrary bytes (via __bytes__ or Py_buffer) %[dox]: insert an integer, encoded as ASCII %[eEfFgG]: insert a float, encoded as ASCII %a: call ascii(), insert result Add a command-line option, disabled by default, that enables the following format codes: %s: if the object has __bytes__ or Py_buffer then insert it. Otherwise, call str() and encode with the 'ascii' codec %r: call repr(), encode with the 'ascii' codec %[iuX]: as per Python 2.x, for backwards compatibility Introducing these extra codes and the command-line option will provide a more gradual upgrade path. The next step in porting could be to examine each %s inside bytes literals and decide if they should either be converted to %b or if the literal should be converted to a unicode literal. Any %r codes could likely be safely changed to %a.
On 1/17/2014 2:49 AM, Neil Schemenauer wrote:
As I see it, there are two separate goals in adding formatting methods to bytes. One is to make it easier to write new programs that manipulate byte data. Another is to make it easier to upgrade Python 2.x programs to Python 3.x. Here is an idea to better address these separate goals.
Introduce %-interpolation for bytes. Support the following format codes to aid in writing new code:
%b: insert arbitrary bytes (via __bytes__ or Py_buffer)
%[dox]: insert an integer, encoded as ASCII
%[eEfFgG]: insert a float, encoded as ASCII
%a: call ascii(), insert result
Add a command-line option, disabled by default, that enables the following format codes:
%s: if the object has __bytes__ or Py_buffer then insert it. Otherwise, call str() and encode with the 'ascii' codec
%r: call repr(), encode with the 'ascii' codec
%[iuX]: as per Python 2.x, for backwards compatibility
Introducing these extra codes and the command-line option will provide a more gradual upgrade path. The next step in porting could be to examine each %s inside bytes literals and decide if they should either be converted to %b or if the literal should be converted to a unicode literal. Any %r codes could likely be safely changed to %a.
-1 overall. Not worth the extra complexity in documentation and command line parameters. %s, since it cannot be used for strings of characters (str) anyway, might as well be used for strings of bytes, and of necessity for single-code-base porting, must be usable in that manner. I would give +.5 to the idea of supporting %a in Python 3 I would give +.2 for %r as a synonym for %a in Python 3. %r and %a don't produce fixed-width fields, so are likely used in places where the exact length in bytes is flexible, and in ASCII segments of the byte stream... supporting them both with the semantics of %a might be useful.
Glenn Linderman <v+python@g.nevcal.com> wrote:
-1 overall.
Not worth the extra complexity in documentation and command line parameters.
Really? It's less than 20 lines of code to implement, probably similar to document. With millions maybe billions of lines of existing Python 2.x code to port, I'm dismayed to hear this objection. Time to take a break from python-dev, I've got paying work to do, programming in Python 2.x. Neil
I've refined this idea a little in my latest PEP 461 patch (issue 20284). Continuing to use %s instead of introducing %b seems better. I've called the commmand-line option -2, it could be used to enable other similar porting aids. I'd like to try porting code making use of the -2 feature to see how helpful it is. The behavior is partway between Python 2.x laziness and Python 3.x strictness in terms of specifying encodings. Python 2.x: - coerce byte strings to unicode strings to avoid making a decision about encoding - when writing a unicode string to a bytes stream without a specified encoding, encode with ASCII. Blow up with an exception if a non-ASCII character is encounted, often far from where the real bug is. Python 3.x: - refuse to accept unicode strings where bytes are expected, require explicit encoding to be preformed Python 3.x with -2 command-line option: - when objects are formatted into bytes, immediately encode them using strict ASCII encoding. No code would be considered fully ported to Python 3 unless it can run without the -2 command line option. Neil
Neil Schemenauer writes:
I'd like to try porting code making use of the -2 feature to see how helpful it is. The behavior is partway between Python 2.x laziness and Python 3.x strictness in terms of specifying encodings.
Python 2.x: [...] Python 3.x: [...]
The above are descriptions of current behavior (ie, unchanged by PEPs 460, 461), and this:
Python 3.x with -2 command-line option:
- when objects are formatted into bytes, immediately encode them using strict ASCII encoding.
is the content of this proposal, is that right?
On 2014-01-18, Stephen J. Turnbull wrote:
The above are descriptions of current behavior (ie, unchanged by PEPs 460, 461), and this: [..] is the content of this proposal, is that right?
The proposal is that -2 enables the following: - %r as an alias for %a (i.e. calls ascii()) - %s will fallback to calling PyObject_Str() and then call _PyUnicode_AsASCIIString(obj, "strict") to convert to bytes That's it. After sleeping on it, I'm not sure that's enough Python 2.x compatibility to help a lot. I haven't ported much code to 3.x yet but I imagine the following are major challenges: - comparisons between str and bytes always returns unequal - indexing/iterating bytes returns integers, not bytes objects - concatenation of str and bytes fails (not so bad since a TypeError is generated right away). Maybe the -2 command line option could revert to Python 2.x behavior for the above but I'm worried it might break working 3.x library code (the %r/%s change is very safe). I think I'll play with the idea and see which unit tests get broken. Ideally, there would be warnings generated when each backwards compatible behavior kicks in, that would greatly help when fixing up code. Neil
Neil Schemenauer writes:
That's it. After sleeping on it, I'm not sure that's enough Python 2.x compatibility to help a lot. I haven't ported much code to 3.x yet but I imagine the following are major challenges:
- comparisons between str and bytes always returns unequal
- indexing/iterating bytes returns integers, not bytes objects
- concatenation of str and bytes fails (not so bad since a TypeError is generated right away).
Experience shows these are rarely major challenges. The reason we are having this discussion is that if you are the kind of programmer who runs into challenges once, you are likely to run into all of the above and more, repeatedly, and addressing them using features available in Python up to v3.3 make your code unreadable. In other words, it's like unemployment at 5%. It would be bearable (just) if the pain were shared by 100% of the people being 5% unemployed, but rather the burden falls on the 5% who are 100% unemployed. Now, the problem that many existing libraries face is that they were designed for monolingual environments where text encodings are more or less ASCII compatible[1]. If you stay in the Python 2 world, you can "internationalize" with the existing design, more or less limp along, fixing encoding bugs as they arise (not "if" but "when", and it can take a decade to find them all). But Python 3 *strongly* discourages that policy. From the point of view of design for the modern environment, such libraries really should have their I/O modules rewritten from scratch (not a huge job), and the necessary adjustments made in processing code (few but randomly dispersed through the code, and each a ticking time bomb for your users). But I stress that the problem here is that the design of such libraries is at fault, not Python 3. The world has changed.[2] And then there are the remaining 5% or so that really need to work mostly in bytes, but want to use string formatting to format their byte streams. I used to think that this was just a porting convenience, but I was wrong. Code written this way is often more concise and more readable than code written using .join() or the struct module. It *should* be written using string formatting. And that's what PEPs 460 and 461 are intended to address. We'll see what happens as these PEPs are implemented, but I suspect that we'll find that there are very few bandaids left that are of much use. That is, as I claimed above, for the remaining problematic libraries a redesign will be needed. Footnotes: [1] In the technical sense that you can rely on ASCII bytes to mean ASCII characters, not part of a non-ASCII character. [2] And if the world *hasn't* changed for your application, what's wrong with staying with Python 2?
A command line parameter?? The annoying part would be telling every single user to call Python with a certain argument and hope they read the README. If it's a library, out of the question. If it's a program, well, I hope your users read READMEs. On Fri, Jan 17, 2014 at 4:49 AM, Neil Schemenauer <nas@arctrix.com> wrote:
As I see it, there are two separate goals in adding formatting methods to bytes. One is to make it easier to write new programs that manipulate byte data. Another is to make it easier to upgrade Python 2.x programs to Python 3.x. Here is an idea to better address these separate goals.
Introduce %-interpolation for bytes. Support the following format codes to aid in writing new code:
%b: insert arbitrary bytes (via __bytes__ or Py_buffer)
%[dox]: insert an integer, encoded as ASCII
%[eEfFgG]: insert a float, encoded as ASCII
%a: call ascii(), insert result
Add a command-line option, disabled by default, that enables the following format codes:
%s: if the object has __bytes__ or Py_buffer then insert it. Otherwise, call str() and encode with the 'ascii' codec
%r: call repr(), encode with the 'ascii' codec
%[iuX]: as per Python 2.x, for backwards compatibility
Introducing these extra codes and the command-line option will provide a more gradual upgrade path. The next step in porting could be to examine each %s inside bytes literals and decide if they should either be converted to %b or if the literal should be converted to a unicode literal. Any %r codes could likely be safely changed to %a.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/rymg19%40gmail.com
-- Ryan When your hammer is C++, everything begins to look like a thumb.
On 2014-01-17, Ryan Gonzalez wrote:
A command line parameter??
I believe it has to be global flag. A __future__ statement will not work. Probably we should allow the flag to be set with an environment variable as well.
The annoying part would be telling every single user to call Python with a certain argument and hope they read the README.
If it's a library, out of the question.
If it's a program, well, I hope your users read READMEs.
The purpose of the command line parameter is not for end users. It is intended to help developers port millions of lines of existing Python 2.x code. I'm very sad if Python core developers don't realize the enormity of the task and don't continue to make efforts to make it easier. Regards, Neil
Regardless, I still feel the introduction of a switch and all that stuff is too complicated. I understand you position, since all my applications are written in Python 2(except 1). However, I don't think this is the best solution. On Fri, Jan 17, 2014 at 2:19 PM, Neil Schemenauer <nas@arctrix.com> wrote:
On 2014-01-17, Ryan Gonzalez wrote:
A command line parameter??
I believe it has to be global flag. A __future__ statement will not work. Probably we should allow the flag to be set with an environment variable as well.
The annoying part would be telling every single user to call Python with a certain argument and hope they read the README.
If it's a library, out of the question.
If it's a program, well, I hope your users read READMEs.
The purpose of the command line parameter is not for end users. It is intended to help developers port millions of lines of existing Python 2.x code. I'm very sad if Python core developers don't realize the enormity of the task and don't continue to make efforts to make it easier.
Regards,
Neil
-- Ryan When your hammer is C++, everything begins to look like a thumb.
participants (4)
-
Glenn Linderman -
Neil Schemenauer -
Ryan Gonzalez -
Stephen J. Turnbull