PEP460 thoughts from a Mercurial dev

(sorry for not piling on any existing threads - I don't subscribe to python-dev due to lack of time) Brett Cannon asked me to chime in - I haven't actually read the very long thread at this point, I'm just providing responses to things Brett mentioned: 1) What do we need in terms of functionality Best guess, %s, %d, and %f. I've not done a full audit of the code, but some limited looking over the grep hits for % in .py files suggests I'm right, and we could even do without %f (we only use that for 'hg --time' output, which we could do in unicode). We also need some way to emit raw bytes (in potentially mixed encodings, yes I know this is "doing it wrong") to stdout/stderr (example: someone changes a file from latin1 to utf8, and then wants to see the resulting diff). 2) Would having it as an external library that worked with Python 2 help? Probably, IF it came with 2.4 support (RHEL support, basically), and we could bundle it in our source tree. It's been extremely valuable to have the install only depend on a working C compiler and Python. 3) If this does go in, how long would it take us to port Mercurial to py3? Would it being in 3.5 hold us up? I'm honestly not sure. I'm still in the outermost layers of this yak shave: fixing cyclic imports. I'll know more when I can at least get 'hg version' to print its own version, because at that point the testsuite failures might be informative. I'd honestly _rather_ this went into 3.5 *and* got lots of validation by both us and twisted (the other folks that care?) before becoming set in stone by a release. Does that make sense? 4) Do we care if it's .format()/%, or could it be in the stdlib? It'd be really nice to not have to boil the oceans as far as editing everyplace in the codebase that does % today. If we do have to do that, it's not going to be much more helpful than something like: def maybestr(a): if isinstance(a, bytes): return a.decode('latin1) return a def sprintf(fmt, *args): (fmt.decode('latin1') % [maybestr(a) for a in args]).encode('latin1) or similar. That was (roughly) what I was figuring I'd do today without any formal bytes-string-formatting support. He also mentioned that some are calling for a shortened 3.5 release cycle - I'd rather not see that happen, for the aforementioned reason of wanting time to make sure this is Right - it'd be a shame to do the work and rush it out only to find something missing in an important way. Feel free to ask further questions - I'll try to respond promptly. AF (For those curious: my hg-on-py3 repo isn't published at the moment because I rebuilt the server it lived on and I forgot to publish it. I'll rectify that sometime this week, I hope, but it's really totally nonfunctional due to cyclic imports.)

On 13 January 2014 23:57, Augie Fackler <raf@durin42.com> wrote:
(sorry for not piling on any existing threads - I don't subscribe to python-dev due to lack of time)
Brett Cannon asked me to chime in - I haven't actually read the very long thread at this point, I'm just providing responses to things Brett mentioned:
1) What do we need in terms of functionality
Best guess, %s, %d, and %f. I've not done a full audit of the code, but some limited looking over the grep hits for % in .py files suggests I'm right, and we could even do without %f (we only use that for 'hg --time' output, which we could do in unicode).
I think PEP 460 will have you covered there, or hopefully asciistr on 3.3+
We also need some way to emit raw bytes (in potentially mixed encodings, yes I know this is "doing it wrong") to stdout/stderr (example: someone changes a file from latin1 to utf8, and then wants to see the resulting diff).
Writing to sys.stdout.buffer may work for that, or else being able to change the encoding of an existing stream. For the latter, Victor had a working patch to _pyio at http://bugs.python.org/issue15216 and general consensus that the semantics were sensible, but it needs to be worked up into a full patch that covers the C version as well (I tried to muster some helpers for that in the leadup to 3.4 feature freeze, but unfortunately without any luck)
2) Would having it as an external library that worked with Python 2 help?
Probably, IF it came with 2.4 support (RHEL support, basically), and we could bundle it in our source tree. It's been extremely valuable to have the install only depend on a working C compiler and Python.
asciicompat.asciistr is just an alias for str on Python 2.x, so if we get that working, it may be something you could vendor into Mercurial for Python 3.3+ support. (There will likely be gaps in what asciistr can do due to interoperability issues in the core types, but the PEP 393 changes to the internal representation mean it should be able to get us pretty close)
3) If this does go in, how long would it take us to port Mercurial to py3? Would it being in 3.5 hold us up?
I'm honestly not sure. I'm still in the outermost layers of this yak shave: fixing cyclic imports. I'll know more when I can at least get 'hg version' to print its own version, because at that point the testsuite failures might be informative. I'd honestly _rather_ this went into 3.5 *and* got lots of validation by both us and twisted (the other folks that care?) before becoming set in stone by a release. Does that make sense?
Yes, that actually makes a lot of sense to me - there's no point in us rushing to get this into 3.4 and then you folks discovering in 6 months it doesn't quite work for you, and then having to wait for 3.5 anyway (or, worse, Python 3 being locked into a solution that doesn't work for you by it's own internal backwards compatibility requirements).
4) Do we care if it's .format()/%, or could it be in the stdlib?
It'd be really nice to not have to boil the oceans as far as editing everyplace in the codebase that does % today. If we do have to do that, it's not going to be much more helpful than something like:
def maybestr(a): if isinstance(a, bytes): return a.decode('latin1) return a
def sprintf(fmt, *args): (fmt.decode('latin1') % [maybestr(a) for a in args]).encode('latin1)
or similar. That was (roughly) what I was figuring I'd do today without any formal bytes-string-formatting support.
Agreed - I think the two solutions that potentially make the most sense are PEP 460 and an interoperable third party type like asciistr. They each have different pros and cons, so I'm actually currently a plan of doing both (if Guido is amenable to my suggestion of providing both ASCII compatible and binary interpolation).
He also mentioned that some are calling for a shortened 3.5 release cycle - I'd rather not see that happen, for the aforementioned reason of wanting time to make sure this is Right - it'd be a shame to do the work and rush it out only to find something missing in an important way.
By shortened, we're mostly talking about ensuring 3.5 is published before the 2.7.9 maintenance release. So early-to-mid 2015 rather than the more typical late 2015.
Feel free to ask further questions - I'll try to respond promptly.
Thanks for the contribution! I found it very helpful :) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Mon, Jan 13, 2014 at 8:51 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 13 January 2014 23:57, Augie Fackler <raf@durin42.com> wrote:
1) What do we need in terms of functionality
Best guess, %s, %d, and %f. I've not done a full audit of the code, but some limited looking over the grep hits for % in .py files suggests I'm right, and we could even do without %f (we only use that for 'hg --time' output, which we could do in unicode).
I think PEP 460 will have you covered there, or hopefully asciistr on 3.3+
I'm confused on how PEP 460 would help -- Augie mentioned %d, which it excludes. -- --Guido van Rossum (python.org/~guido)

On Mon, Jan 13, 2014 at 12:34 PM, Guido van Rossum <guido@python.org> wrote:
On Mon, Jan 13, 2014 at 8:51 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 13 January 2014 23:57, Augie Fackler <raf@durin42.com> wrote:
1) What do we need in terms of functionality
Best guess, %s, %d, and %f. I've not done a full audit of the code, but some limited looking over the grep hits for % in .py files suggests I'm right, and we could even do without %f (we only use that for 'hg --time' output, which we could do in unicode).
I think PEP 460 will have you covered there, or hopefully asciistr on 3.3+
I'm confused on how PEP 460 would help -- Augie mentioned %d, which it excludes.
Yes - not having %d makes this much much less useful to me. For my part, it'd probably be fine if we could do %s (which would handle an RHS that was bytes, and only bytes, no handing of str or __bytes__-type stuff at all) and %d (with all the usual format modifiers, and would result in an ascii-compatible sequence of bytes all the time).

On Mon, Jan 13, 2014 at 9:37 AM, Augie Fackler <raf@durin42.com> wrote:
On Mon, Jan 13, 2014 at 12:34 PM, Guido van Rossum <guido@python.org> wrote:
On Mon, Jan 13, 2014 at 8:51 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 13 January 2014 23:57, Augie Fackler <raf@durin42.com> wrote:
1) What do we need in terms of functionality
Best guess, %s, %d, and %f. I've not done a full audit of the code, but some limited looking over the grep hits for % in .py files suggests I'm right, and we could even do without %f (we only use that for 'hg --time' output, which we could do in unicode).
I think PEP 460 will have you covered there, or hopefully asciistr on 3.3+
I'm confused on how PEP 460 would help -- Augie mentioned %d, which it excludes.
Yes - not having %d makes this much much less useful to me.
For my part, it'd probably be fine if we could do %s (which would handle an RHS that was bytes, and only bytes, no handing of str or __bytes__-type stuff at all) and %d (with all the usual format modifiers, and would result in an ascii-compatible sequence of bytes all the time).
Would it be okay of instead of %s you had to use %b for those semantics? (%d would still exist) -- --Guido van Rossum (python.org/~guido)

On Mon, Jan 13, 2014 at 12:39 PM, Guido van Rossum <guido@python.org> wrote:
On Mon, Jan 13, 2014 at 9:37 AM, Augie Fackler <raf@durin42.com> wrote:
On Mon, Jan 13, 2014 at 12:34 PM, Guido van Rossum <guido@python.org>
wrote:
On Mon, Jan 13, 2014 at 8:51 AM, Nick Coghlan <ncoghlan@gmail.com>
wrote:
On 13 January 2014 23:57, Augie Fackler <raf@durin42.com> wrote:
1) What do we need in terms of functionality
Best guess, %s, %d, and %f. I've not done a full audit of the code, but some limited looking over the grep hits for % in .py files suggests I'm right, and we could even do without %f (we only use that for 'hg --time' output, which we could do in unicode).
I think PEP 460 will have you covered there, or hopefully asciistr on 3.3+
I'm confused on how PEP 460 would help -- Augie mentioned %d, which it excludes.
Yes - not having %d makes this much much less useful to me.
For my part, it'd probably be fine if we could do %s (which would handle an RHS that was bytes, and only bytes, no handing of str or __bytes__-type stuff at all) and %d (with all the usual format modifiers, and would result in an ascii-compatible sequence of bytes all the time).
Would it be okay of instead of %s you had to use %b for those semantics? (%d would still exist)
Probably, but it'd be quite painful, since we'd have to to some kind of .sub() call all over the place to remain compatible with 2.4 and 2.6. Dropping 2.4 might be possible in the 3.5 timeframe - 2.6 almost certainly not.

On Mon, 13 Jan 2014 09:34:39 -0800 Guido van Rossum <guido@python.org> wrote:
On Mon, Jan 13, 2014 at 8:51 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 13 January 2014 23:57, Augie Fackler <raf@durin42.com> wrote:
1) What do we need in terms of functionality
Best guess, %s, %d, and %f. I've not done a full audit of the code, but some limited looking over the grep hits for % in .py files suggests I'm right, and we could even do without %f (we only use that for 'hg --time' output, which we could do in unicode).
I think PEP 460 will have you covered there, or hopefully asciistr on 3.3+
I'm confused on how PEP 460 would help -- Augie mentioned %d, which it excludes.
Serhiy did a survey of formatting codes in the Mercurial sources: https://mail.python.org/pipermail/python-dev/2014-January/130969.html Regards Antoine.

Antoine Pitrou <solipsis <at> pitrou.net> writes:
On Mon, 13 Jan 2014 09:34:39 -0800 Guido van Rossum <guido <at> python.org> wrote:
On Mon, Jan 13, 2014 at 8:51 AM, Nick Coghlan <ncoghlan <at> gmail.com>
wrote:
On 13 January 2014 23:57, Augie Fackler <raf <at> durin42.com> wrote:
1) What do we need in terms of functionality
Best guess, %s, %d, and %f. I've not done a full audit of the code, but some limited looking over the grep hits for % in .py files suggests I'm right, and we could even do without %f (we only use that for 'hg --time' output, which we could do in unicode).
I think PEP 460 will have you covered there, or hopefully asciistr on 3.3+
I'm confused on how PEP 460 would help -- Augie mentioned %d, which it excludes.
Serhiy did a survey of formatting codes in the Mercurial sources: https://mail.python.org/pipermail/python-dev/2014-January/130969.html
Note that a lot of those are in debug code (eg the only %f I've spotted is), or are time format specifiers (which can be unicode just fine). A few others (eg %ln) are for our internal revset format-string language, so this overstates what we'd need in bytes by a little. %f would probably be good too, as I look a little more. (Please don't remove me from the CC list - I could only respond via gmane because I'm not subscribed to python-dev.)
Regards
Antoine.

On Mon, 13 Jan 2014 18:51:32 +0000 (UTC) Augie Fackler <raf@durin42.com> wrote:
(Please don't remove me from the CC list - I could only respond via gmane because I'm not subscribed to python-dev.)
Responding via gmane is what I do, too :-) My NNTP client doesn't allow SMTP / NNTP mixed postings, so I'm forced to remove you from CC. Regards Antoine.

On 14 Jan 2014 03:34, "Guido van Rossum" <guido@python.org> wrote:
On Mon, Jan 13, 2014 at 8:51 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 13 January 2014 23:57, Augie Fackler <raf@durin42.com> wrote:
1) What do we need in terms of functionality
Best guess, %s, %d, and %f. I've not done a full audit of the code,
but some
limited looking over the grep hits for % in .py files suggests I'm right, and we could even do without %f (we only use that for 'hg --time' output, which we could do in unicode).
I think PEP 460 will have you covered there, or hopefully asciistr on 3.3+
I'm confused on how PEP 460 would help -- Augie mentioned %d, which it excludes.
I meant your proposed more lenient version (since there's no need for the binary only version to be in the common 2/3 subset). Cheers, Nick.
-- --Guido van Rossum (python.org/~guido)

13.01.14 15:57, Augie Fackler написав(ла):
1) What do we need in terms of functionality
Best guess, %s, %d, and %f. I've not done a full audit of the code, but some limited looking over the grep hits for % in .py files suggests I'm right, and we could even do without %f (we only use that for 'hg --time' output, which we could do in unicode).
Most popular formatting codes in Mercurial sources (excluding %Y, %M, etc): 2519 %s 493 %d 102 %r 33 %i 23 %ld 19 %ln 12 %.3f 10 %.1f 9 %(val)r 9 %p 9 %.2f %s covers almost 80% of use cases and %d covers almost 20%. %r covers about 3%, %f covers less than 1%. So I think anything except %s and %d can be ignored.
participants (5)
-
Antoine Pitrou
-
Augie Fackler
-
Guido van Rossum
-
Nick Coghlan
-
Serhiy Storchaka