Status of json (simplejson) in cpython
Hi all, it all started with issue10019. The version we have in cpython of json is simplejson 2.0.9 highly patched (either because it was converted to py3k, and because of the normal flow of issues/bugfixes) while upstream have already released 2.1.13 . Their 2 roads had diverged a lot, and since this blocks any further update of cpython's json from upstream, I'd like to close this gap. This isn't exactly an easy task, and this email is more about a brainstorming on the ways we have to achieve the goal: being able to upgrade json to 2.1.13. Luckily, upstream is receptive for patches, so part of the job is to forward patches written for cpython not already in the upstream code. But how am I going to do this? let's do a brain-dump: - the history goes back at changeset f686aced02a3 (May 2009, wow) when 2.0.9 was merged on trunk - I can navigate from that CS up to tip, and examine the diffs and see if they apply to 2.1.3 and prepare a set of patches to be forwarded - part of those diffs is about py3k conversion, that probably needs to be extended to other part of the upstream code not currently in cpython. For those "new" code parts, do you have some guides about porting a project to py3k? it would be my first time and other than building it and running it with python3 i don't know what to do :) - once (and if :) I reach the point where I've all the relevant patches applied on 2.1.3 what's the next step? -- take 2.1.3 + patches, copy it on Lib/json + test + Modules and see what breaks? -- what about the doc? (uh luckily I just noticed it's already in the upstream repo, so another thing to sync) - what are we going to do in the long run? how can we assure we'll be having a healthy collaboration with upsteam? f.e. in case a bug is reported (and later on fixed) in cpython? is there a policy for projects present in cpython and also maintained elsewhere? At the end: do you have some suggestions that might this task be successful? advice on the steps above, tips about the merge, something like this. Thanks a lot for your time, -- Sandro Tosi (aka morph, morpheus, matrixhasu) My website: http://matrixhasu.altervista.org/ Me at Debian: http://wiki.debian.org/SandroTosi
On Thu, 14 Apr 2011 21:22:27 +0200
Sandro Tosi
But how am I going to do this? let's do a brain-dump:
IMHO, you should compute the diff between 2.0.9 and 2.1.3 and try to apply it to the CPython source tree (you'll probably have to change the file paths).
- what are we going to do in the long run? how can we assure we'll be having a healthy collaboration with upsteam?
Tricky question... :/ Regards Antoine.
- what are we going to do in the long run? how can we assure we'll be having a healthy collaboration with upsteam? f.e. in case a bug is reported (and later on fixed) in cpython? is there a policy for projects present in cpython and also maintained elsewhere?
At the end: do you have some suggestions that might this task be successful? advice on the steps above, tips about the merge, something like this.
I think it would be useful if the porting was done all-over, in a way that allows upstream to provide 2.x and 3.x out of a single code base, and get this port merged into upstream. If there are bug fixes that we made on the json algorithms proper, these would have to be identified and redone, or simply ignored (hoping that somebody will re-report them if the issue persists). A necessary prerequisite is that we have a dedicated maintainer of the json package. Regards, Martin
On Apr 14, 2011, at 12:22 PM, Sandro Tosi wrote:
The version we have in cpython of json is simplejson 2.0.9 highly patched (either because it was converted to py3k, and because of the normal flow of issues/bugfixes) while upstream have already released 2.1.13 .
Their 2 roads had diverged a lot, and since this blocks any further update of cpython's json from upstream, I'd like to close this gap.
Are you proposing updates to the Python 3.3 json module to include newer features like use_decimal and changing the indent argument from an integer to a string?
- what are we going to do in the long run?
If Bob shows no interest in Python 3, then the code bases will probably continue to diverge. Since the JSON spec is set in stone, the changes will mostly be about API (indentation, object conversion, etc) and optimization. I presume the core parsing logic won't be changing much. Raymond
On Thu, Apr 14, 2011 at 2:29 PM, Raymond Hettinger
On Apr 14, 2011, at 12:22 PM, Sandro Tosi wrote:
The version we have in cpython of json is simplejson 2.0.9 highly patched (either because it was converted to py3k, and because of the normal flow of issues/bugfixes) while upstream have already released 2.1.13 .
Their 2 roads had diverged a lot, and since this blocks any further update of cpython's json from upstream, I'd like to close this gap.
Are you proposing updates to the Python 3.3 json module to include newer features like use_decimal and changing the indent argument from an integer to a string?
https://github.com/simplejson/simplejson/blob/master/CHANGES.txt
- what are we going to do in the long run?
If Bob shows no interest in Python 3, then the code bases will probably continue to diverge.
I don't have any real interest in Python 3, but if someone contributes the code to make simplejson work in Python 3 I'm willing to apply the patches run the tests against any future changes. The porting work to make it suitable for the standard library at that point should be something that can be automated since it will be moving some files around and changing the string simplejson to json in a whole bunch of places.
Since the JSON spec is set in stone, the changes will mostly be about API (indentation, object conversion, etc) and optimization. I presume the core parsing logic won't be changing much.
Actually the core parsing logic is very different (and MUCH faster), which is why the merge is tricky. There's the potential for it to change more in the future, there's definitely more room for optimization. Probably not in the pure python parser, but the C one. -bob
Since the JSON spec is set in stone, the changes will mostly be about API (indentation, object conversion, etc) and optimization. I presume the core parsing logic won't be changing much.
Actually the core parsing logic is very different (and MUCH faster),
Are you talking about the Python logic or the C logic?
On Friday, April 15, 2011, Antoine Pitrou
Since the JSON spec is set in stone, the changes will mostly be about API (indentation, object conversion, etc) and optimization. I presume the core parsing logic won't be changing much.
Actually the core parsing logic is very different (and MUCH faster),
Are you talking about the Python logic or the C logic?
Both, actually. IIRC simplejson in pure python typically beats json with it's C extension. -bob
Le vendredi 15 avril 2011 à 14:18 -0700, Bob Ippolito a écrit :
On Friday, April 15, 2011, Antoine Pitrou
wrote: Since the JSON spec is set in stone, the changes will mostly be about API (indentation, object conversion, etc) and optimization. I presume the core parsing logic won't be changing much.
Actually the core parsing logic is very different (and MUCH faster),
Are you talking about the Python logic or the C logic?
Both, actually. IIRC simplejson in pure python typically beats json with it's C extension.
Really? It would be nice to see some concrete benchmarks against both repo tips. Regards Antoine.
On Fri, Apr 15, 2011 at 2:20 PM, Antoine Pitrou
Le vendredi 15 avril 2011 à 14:18 -0700, Bob Ippolito a écrit :
On Friday, April 15, 2011, Antoine Pitrou
wrote: Since the JSON spec is set in stone, the changes will mostly be about API (indentation, object conversion, etc) and optimization. I presume the core parsing logic won't be changing much.
Actually the core parsing logic is very different (and MUCH faster),
Are you talking about the Python logic or the C logic?
Both, actually. IIRC simplejson in pure python typically beats json with it's C extension.
Really? It would be nice to see some concrete benchmarks against both repo tips.
Maybe in a few weeks or months when I have time to finish up the benchmarks that I was working on... but it should be pretty easy for anyone to show that the version in CPython is very slow (and uses a lot more memory) in comparison to simplejson. -bob
On Fri, 15 Apr 2011 14:27:04 -0700
Bob Ippolito
On Fri, Apr 15, 2011 at 2:20 PM, Antoine Pitrou
wrote: Le vendredi 15 avril 2011 à 14:18 -0700, Bob Ippolito a écrit :
On Friday, April 15, 2011, Antoine Pitrou
wrote: Since the JSON spec is set in stone, the changes will mostly be about API (indentation, object conversion, etc) and optimization. I presume the core parsing logic won't be changing much.
Actually the core parsing logic is very different (and MUCH faster),
Are you talking about the Python logic or the C logic?
Both, actually. IIRC simplejson in pure python typically beats json with it's C extension.
Really? It would be nice to see some concrete benchmarks against both repo tips.
Maybe in a few weeks or months when I have time to finish up the benchmarks that I was working on... but it should be pretty easy for anyone to show that the version in CPython is very slow (and uses a lot more memory) in comparison to simplejson.
Well, here's a crude microbenchmark. I'm comparing 2.6+simplejson 2.1.3 to 3.3+json, so I'm avoiding integers: * json.dumps: $ python -m timeit -s "from simplejson import dumps, loads; \ d = dict((str(i), str(i)) for i in range(1000))" \ "dumps(d)" - 2.6+simplejson: 372 usec per loop - 3.2+json: 352 usec per loop * json.loads: $ python -m timeit -s "from simplejson import dumps, loads; \ d = dict((str(i), str(i)) for i in range(1000)); s = dumps(d)" \ "loads(s)" - 2.6+simplejson: 224 usec per loop - 3.2+json: 233 usec per loop The runtimes look quite similar. Antoine.
On Fri, Apr 15, 2011 at 4:12 PM, Antoine Pitrou
On Fri, 15 Apr 2011 14:27:04 -0700 Bob Ippolito
wrote: On Fri, Apr 15, 2011 at 2:20 PM, Antoine Pitrou
wrote: Le vendredi 15 avril 2011 à 14:18 -0700, Bob Ippolito a écrit :
On Friday, April 15, 2011, Antoine Pitrou
wrote: > Since the JSON spec is set in stone, the changes > will mostly be about API (indentation, object conversion, etc) > and optimization. I presume the core parsing logic won't > be changing much.
Actually the core parsing logic is very different (and MUCH faster),
Are you talking about the Python logic or the C logic?
Both, actually. IIRC simplejson in pure python typically beats json with it's C extension.
Really? It would be nice to see some concrete benchmarks against both repo tips.
Maybe in a few weeks or months when I have time to finish up the benchmarks that I was working on... but it should be pretty easy for anyone to show that the version in CPython is very slow (and uses a lot more memory) in comparison to simplejson.
Well, here's a crude microbenchmark. I'm comparing 2.6+simplejson 2.1.3 to 3.3+json, so I'm avoiding integers:
* json.dumps:
$ python -m timeit -s "from simplejson import dumps, loads; \ d = dict((str(i), str(i)) for i in range(1000))" \ "dumps(d)"
- 2.6+simplejson: 372 usec per loop - 3.2+json: 352 usec per loop
* json.loads:
$ python -m timeit -s "from simplejson import dumps, loads; \ d = dict((str(i), str(i)) for i in range(1000)); s = dumps(d)" \ "loads(s)"
- 2.6+simplejson: 224 usec per loop - 3.2+json: 233 usec per loop
The runtimes look quite similar.
That's the problem with trivial benchmarks. With more typical data (for us, anyway) you should see very different results. -bob
Sandro Tosi
The version we have in cpython of json is simplejson 2.0.9 highly patched (either because it was converted to py3k, and because of the normal flow of issues/bugfixes) while upstream have already released 2.1.13 .
I think you mean 2.1.3?
Their 2 roads had diverged a lot, and since this blocks any further update of cpython's json from upstream, I'd like to close this gap.
This isn't exactly an easy task, and this email is more about a brainstorming on the ways we have to achieve the goal: being able to upgrade json to 2.1.13.
Luckily, upstream is receptive for patches, so part of the job is to forward patches written for cpython not already in the upstream code.
But how am I going to do this? let's do a brain-dump:
- the history goes back at changeset f686aced02a3 (May 2009, wow) when 2.0.9 was merged on trunk - I can navigate from that CS up to tip, and examine the diffs and see if they apply to 2.1.3 and prepare a set of patches to be forwarded - part of those diffs is about py3k conversion, that probably needs to be extended to other part of the upstream code not currently in cpython. For those "new" code parts, do you have some guides about porting a project to py3k? it would be my first time and other than building it and running it with python3 i don't know what to do :) - once (and if :) I reach the point where I've all the relevant patches applied on 2.1.3 what's the next step?
If it is generally considered desirable to maintain some synchrony between simplejson and stdlib json, then since Bob has stated that he no interest in Python 3, it may be better to: 1. Convert the simplejson codebase so that it runs on both Python 2 and 3 (without running 2to3 on it). Once this is done, if upstream accepts these changes, ongoing maintenance will be fairly simple for upstream, and changes only really need to consider exception and string/byte literal syntax, for the most part. 2. Merge this new simplejson with stdlib json for 3.3. I looked at step 1 a few weeks ago and have made some progress with it. I've just forked simplejson on Github and posted my changes to my fork: https://github.com/vsajip/simplejson All 136 tests pass on Python 2.7 (just as a control/sanity check), and on Python 3.2, there are 4 failures and 12 errors - see complete results at https://gist.github.com/923019 I haven't looked at the C extension yet, just the Python code. I believe most of the test failures will be down to string literals in the tests which should be bytes, e.g. test_unicode.py:test_ensure_ascii_false_bytestring_encoding. So, it looks quite encouraging, and if you think my approach has merit, please take a look at my fork, and give feedback/join in! Note that I used the same approach when porting pip/virtualenv to Python 3, which seems to have gone quite smoothly :-) Regards, Vinay Sajip
Sandro Tosi
Luckily, upstream is receptive for patches, so part of the job is to forward patches written for cpython not already in the upstream code.
Further to my earlier response to your post, I should mention that my fork of simplejson at https://github.com/vsajip/simplejson/ passes all 136 tests for Python 2.7 and 3.2 (not been able to test with 3.3a0 yet). No tests were skipped, though adjustments were made for binary/string literals and for one case where sorting was applied to incompatible types in the tests. Test output is at https://gist.github.com/923019 Bob - If you're reading this, what would you say to having a look at my fork, and comment on the feasibility of merging my changes back into your master? The changes are fairly easy to understand, all tests pass, and it's a 2.x/3.x single codebase, so maintenance should be easier than with multiple codebases. Admittedly I haven't looked at the C code yet, but that's next on my list. Regards, Vinay Sajip
Hello Vinay,
On Sat, 16 Apr 2011 09:50:25 +0000 (UTC)
Vinay Sajip
If it is generally considered desirable to maintain some synchrony between simplejson and stdlib json, then since Bob has stated that he no interest in Python 3, it may be better to:
1. Convert the simplejson codebase so that it runs on both Python 2 and 3 (without running 2to3 on it). Once this is done, if upstream accepts these changes, ongoing maintenance will be fairly simple for upstream, and changes only really need to consider exception and string/byte literal syntax, for the most part. 2. Merge this new simplejson with stdlib json for 3.3.
What you're proposing doesn't address the question of who is going to do the ongoing maintenance. Bob apparently isn't interested in maintaining stdlib code, and python-dev members aren't interested in maintaining simplejson (assuming it would be at all possible). Since both groups of people want to work on separate codebases, I don't see how sharing a single codebase would be possible. Regards Antoine.
On Sat, Apr 16, 2011 at 16:19, Antoine Pitrou
What you're proposing doesn't address the question of who is going to do the ongoing maintenance. Bob apparently isn't interested in maintaining stdlib code, and python-dev members aren't interested in maintaining simplejson (assuming it would be at all possible). Since both groups of people want to work on separate codebases, I don't see how sharing a single codebase would be possible.
From reading this thread, it seems to me like the proposal is that Bob maintains a simplejson for both 2.x and 3.x and that the current stdlib json is replaced by a (trivially changed) version of simplejson.
Cheers, Dirkjan
Le samedi 16 avril 2011 à 16:42 +0200, Dirkjan Ochtman a écrit :
On Sat, Apr 16, 2011 at 16:19, Antoine Pitrou
wrote: What you're proposing doesn't address the question of who is going to do the ongoing maintenance. Bob apparently isn't interested in maintaining stdlib code, and python-dev members aren't interested in maintaining simplejson (assuming it would be at all possible). Since both groups of people want to work on separate codebases, I don't see how sharing a single codebase would be possible.
From reading this thread, it seems to me like the proposal is that Bob maintains a simplejson for both 2.x and 3.x and that the current stdlib json is replaced by a (trivially changed) version of simplejson.
The thing is, we want to bring our own changes to the json module and its tests (and have already done so, although some have been backported to simplejson). Regards Antoine.
On 2011-04-16, at 16:52 , Antoine Pitrou wrote:
Le samedi 16 avril 2011 à 16:42 +0200, Dirkjan Ochtman a écrit :
On Sat, Apr 16, 2011 at 16:19, Antoine Pitrou
wrote: What you're proposing doesn't address the question of who is going to do the ongoing maintenance. Bob apparently isn't interested in maintaining stdlib code, and python-dev members aren't interested in maintaining simplejson (assuming it would be at all possible). Since both groups of people want to work on separate codebases, I don't see how sharing a single codebase would be possible.
From reading this thread, it seems to me like the proposal is that Bob maintains a simplejson for both 2.x and 3.x and that the current stdlib json is replaced by a (trivially changed) version of simplejson.
The thing is, we want to bring our own changes to the json module and its tests (and have already done so, although some have been backported to simplejson).
Depending on what those changes are, would it not be possible to apply the vast majority of them to simplejson itself? Furthermore, now that python uses Mercurial, it should be possible (or even easy) to use a versioned queue (via MQ) for the trivial adaptation, and the temporary alterations (things which will likely be merged back into simplejson but are not yet, stuff like that) should it not?
Le samedi 16 avril 2011 à 17:07 +0200, Xavier Morel a écrit :
On 2011-04-16, at 16:52 , Antoine Pitrou wrote:
Le samedi 16 avril 2011 à 16:42 +0200, Dirkjan Ochtman a écrit :
On Sat, Apr 16, 2011 at 16:19, Antoine Pitrou
wrote: What you're proposing doesn't address the question of who is going to do the ongoing maintenance. Bob apparently isn't interested in maintaining stdlib code, and python-dev members aren't interested in maintaining simplejson (assuming it would be at all possible). Since both groups of people want to work on separate codebases, I don't see how sharing a single codebase would be possible.
From reading this thread, it seems to me like the proposal is that Bob maintains a simplejson for both 2.x and 3.x and that the current stdlib json is replaced by a (trivially changed) version of simplejson.
The thing is, we want to bring our own changes to the json module and its tests (and have already done so, although some have been backported to simplejson).
Depending on what those changes are, would it not be possible to apply the vast majority of them to simplejson itself?
Sure, but the thing is, I don't *think* we are interested in backporting stuff to simplejson much more than Bob is interested in porting stuff to the json module. I've contributed a couple of patches myself after they were integrated to CPython (they are part of the performance improvements Bob is talking about), but that was exceptional. Backporting a patch to another project with a different directory structure, a slightly different code, etc. is tedious and not very rewarding for us Python core developers, while we could do other work on our limited free time. Also, some types of work would be tedious to backport, for example if we refactor the tests to test both the C and Python implementations.
Furthermore, now that python uses Mercurial, it should be possible (or even easy) to use a versioned queue (via MQ) for the trivial adaptation, and the temporary alterations (things which will likely be merged back into simplejson but are not yet, stuff like that) should it not?
Perhaps, perhaps not. That would require someone motivated to put it in place, ensure that it doesn't get in the way, document it, etc. Honestly, I don't think maintaining a single stdlib module should require such an amount of logistics. Regards Antoine.
Antoine Pitrou, 16.04.2011 16:19:
On Sat, 16 Apr 2011 09:50:25 +0000 (UTC) Vinay Sajip wrote:
If it is generally considered desirable to maintain some synchrony between simplejson and stdlib json, then since Bob has stated that he no interest in Python 3, it may be better to:
1. Convert the simplejson codebase so that it runs on both Python 2 and 3 (without running 2to3 on it). Once this is done, if upstream accepts these changes, ongoing maintenance will be fairly simple for upstream, and changes only really need to consider exception and string/byte literal syntax, for the most part. 2. Merge this new simplejson with stdlib json for 3.3.
What you're proposing doesn't address the question of who is going to do the ongoing maintenance. Bob apparently isn't interested in maintaining stdlib code, and python-dev members aren't interested in maintaining simplejson (assuming it would be at all possible). Since both groups of people want to work on separate codebases, I don't see how sharing a single codebase would be possible.
Well, if that is not possible, then the CPython devs will have a hard time maintaining the json accelerator module in the long run. I quickly skipped through the github version in simplejson, and it truly is some complicated piece of code. Not in the sense that the code is ununderstandable, it's actually fairly straight forward string processing code, but it's so extremely optimised and tailored and has so much code duplicated for the bytes and unicode types (apparently following the copy+paste+adapt pattern) that it will be pretty hard to adapt to future changes of CPython, especially the upcoming PEP 393 implementation. Maintaining this is clearly no fun. Stefan
On Saturday, April 16, 2011, Antoine Pitrou
Le samedi 16 avril 2011 à 17:07 +0200, Xavier Morel a écrit :
On 2011-04-16, at 16:52 , Antoine Pitrou wrote:
Le samedi 16 avril 2011 à 16:42 +0200, Dirkjan Ochtman a écrit :
On Sat, Apr 16, 2011 at 16:19, Antoine Pitrou
wrote: What you're proposing doesn't address the question of who is going to do the ongoing maintenance. Bob apparently isn't interested in maintaining stdlib code, and python-dev members aren't interested in maintaining simplejson (assuming it would be at all possible). Since both groups of people want to work on separate codebases, I don't see how sharing a single codebase would be possible.
From reading this thread, it seems to me like the proposal is that Bob maintains a simplejson for both 2.x and 3.x and that the current stdlib json is replaced by a (trivially changed) version of simplejson.
The thing is, we want to bring our own changes to the json module and its tests (and have already done so, although some have been backported to simplejson).
Depending on what those changes are, would it not be possible to apply the vast majority of them to simplejson itself?
Sure, but the thing is, I don't *think* we are interested in backporting stuff to simplejson much more than Bob is interested in porting stuff to the json module.
I've backported every useful patch (for 2.x) I noticed from json to simplejson. Would be happy to apply any that I missed if anyone can point these out.
I've contributed a couple of patches myself after they were integrated to CPython (they are part of the performance improvements Bob is talking about), but that was exceptional. Backporting a patch to another project with a different directory structure, a slightly different code, etc. is tedious and not very rewarding for us Python core developers, while we could do other work on our limited free time.
That's exactly why I am not interested in stdlib maintenance myself, I only use 2.x and that's frozen... so I can't maintain the version we would actually use.
Also, some types of work would be tedious to backport, for example if we refactor the tests to test both the C and Python implementations.
simplejson's test suite has tested both for quite some time.
Furthermore, now that python uses Mercurial, it should be possible (or even easy) to use a versioned queue (via MQ) for the trivial adaptation, and the temporary alterations (things which will likely be merged back into simplejson but are not yet, stuff like that) should it not?
Perhaps, perhaps not. That would require someone motivated to put it in place, ensure that it doesn't get in the way, document it, etc. Honestly, I don't think maintaining a single stdlib module should require such an amount of logistics.
It certainly shouldn't, especially because neither of them changes very fast. -bob
On Sat, 16 Apr 2011 18:04:53 +0200
Stefan Behnel
Well, if that is not possible, then the CPython devs will have a hard time maintaining the json accelerator module in the long run. I quickly skipped through the github version in simplejson, and it truly is some complicated piece of code. Not in the sense that the code is ununderstandable, it's actually fairly straight forward string processing code, but it's so extremely optimised and tailored and has so much code duplicated for the bytes and unicode types (apparently following the copy+paste+adapt pattern) that it will be pretty hard to adapt to future changes of CPython, especially the upcoming PEP 393 implementation.
Well, first, the Python 3 version doesn't have the duplicated code since it doesn't accept bytes input. Second, it's not that complicated, and we have already brought improvements to it, meaning we know the code ("we" is at least Raymond and I). For example, see http://bugs.python.org/issue11856 for a pending patch.
Maintaining this is clearly no fun.
No more than any optimized piece of C code, but no less either. It's actually quite straightforward compared to other classes such as TextIOWrapper. PEP 393 will be a challenge for significant chunks of the interpreter and extension modules; it's not a json-specific issue. Regards Antoine.
On 2011-04-16, at 17:25 , Antoine Pitrou wrote:
Le samedi 16 avril 2011 à 17:07 +0200, Xavier Morel a écrit :
On 2011-04-16, at 16:52 , Antoine Pitrou wrote:
Le samedi 16 avril 2011 à 16:42 +0200, Dirkjan Ochtman a écrit :
On Sat, Apr 16, 2011 at 16:19, Antoine Pitrou
wrote: What you're proposing doesn't address the question of who is going to do the ongoing maintenance. Bob apparently isn't interested in maintaining stdlib code, and python-dev members aren't interested in maintaining simplejson (assuming it would be at all possible). Since both groups of people want to work on separate codebases, I don't see how sharing a single codebase would be possible.
From reading this thread, it seems to me like the proposal is that Bob maintains a simplejson for both 2.x and 3.x and that the current stdlib json is replaced by a (trivially changed) version of simplejson.
The thing is, we want to bring our own changes to the json module and its tests (and have already done so, although some have been backported to simplejson).
Depending on what those changes are, would it not be possible to apply the vast majority of them to simplejson itself?
Sure, but the thing is, I don't *think* we are interested in backporting stuff to simplejson much more than Bob is interested in porting stuff to the json module. I was mostly thinking it could work the other way around, really: simplejson seems to move slightly faster than the stdlib's json (though it's not a high-churn module either these days), so improvements (from Python and third parties alike) could be applied there first and then forward-ported, rather than the other way around.
I've contributed a couple of patches myself after they were integrated to CPython (they are part of the performance improvements Bob is talking about), but that was exceptional. Backporting a patch to another project with a different directory structure, a slightly different code, etc. is tedious and not very rewarding for us Python core developers, while we could do other work on our limited free time. Sure, I can understand that, but wouldn't it be easier if the two versions were kept in better sync (mostly removing the "slightly different code" part)?
Furthermore, now that python uses Mercurial, it should be possible (or even easy) to use a versioned queue (via MQ) for the trivial adaptation, and the temporary alterations (things which will likely be merged back into simplejson but are not yet, stuff like that) should it not? Perhaps, perhaps not. That would require someone motivated to put it in place, ensure that it doesn't get in the way, document it, etc. Honestly, I don't think maintaining a single stdlib module should require such an amount of logistics.
I don't think mercurial queues really amount to logistic, it takes a bit of learning but fundamentally they're not much work, and make synchronization with upstream packages much easier. Which would (I believe) benefit both projects and — ultimately — language users by avoiding too extreme differences (on both API/features and performances). I'm thinking of a relation along the lines of Michael Foord's unittest2 (except maybe inverted, in that unittest2 is a backport of a next version's unittest)
Hi Antoine,
Antoine Pitrou
What you're proposing doesn't address the question of who is going to do the ongoing maintenance.
I agree, my suggestion is orthogonal to the question of who maintains stdlib json. But if the json module is languishing in comparison to simplejson, then bringing the code bases closer together may be worthwhile. I've just been experimenting with the feasibility of getting simplejson running on Python 3.x, and at present I have it working in the sense of all tests passing on 3.2. Bob has said he isn't interested in Python 3, but he has said that "if someone contributes the code to make simplejson work in Python 3 I'm willing to apply the patches run the tests against any future changes." I take this to mean that Bob is undertaking to keep the codebase working in both 2.x and 3.x in the future (though I'm sure he'll correct me if I've got it wrong). I'm also assuming Bob will be receptive to patches which are functional improvements added in stdlib json in 3.x, as his comments seem to indicate that this is the case. ISTM that for some library maintainers who are invested in 2.x and who don't have the time or inclination to manage separate 2.x and 3.x codebases, a common codebase is the way to go. This certainly seems to be the case for pip and virtualenv, which we recently got running under Python 3 using a common codebase approach. Certainly, the amount of work required for ongoing maintenance can be much less, and only a little discipline is needed when adding new code. Bob made a comment in passing that simplejson (Python) is about as fast as stdlib json (C extension), on 2.x. That may or may not prove to be the case on 3.x, but at least it is now possible to run simplejson on 3.x (Python only, so far) to make a comparison. It may be that no-one is willing or able to serve as an effective maintainer of stdlib json, but assuming that Bob will continue to maintain and improve simplejson and if an automatic mechanism for converting from a 3.x-compatible simplejson to json can be made to work, that could be a way forward. It's obviously early days to see how things will pan out, but it seems worth exploring the avenue a little further, if Bob is amenable to this approach. Regards, Vinay
I've contributed a couple of patches myself after they were integrated to CPython (they are part of the performance improvements Bob is talking about), but that was exceptional. Backporting a patch to another project with a different directory structure, a slightly different code, etc. is tedious and not very rewarding for us Python core developers, while we could do other work on our limited free time. Sure, I can understand that, but wouldn't it be easier if the two versions were kept in better sync (mostly removing the "slightly different code" part)?
You are assuming that we intend to backport all our json patches to simplejson. I can't speak for other people, but I'm personally not interested in doing that work (even if you find an "easier" scheme than the current one). Also, as Raymond said, it's not much of an issue if json and simplejson diverge. Bob said he had no interest in porting simplejson to 3.x, while we don't have any interest in making non-bugfix changes to 2.x json. As long as basic functionality is identical and compliance to the spec is ensured, I think most common uses are covered by both libraries. So, unless you manage to find a scheme where porting patches is almost zero-cost (for either us or Bob), I don't think it will happen.
I'm thinking of a relation along the lines of Michael Foord's unittest2 (except maybe inverted, in that unittest2 is a backport of a next version's unittest)
Well, the big difference here is that Michael maintains both the stdlib version and the standalone project, meaning he's committed to avoid any divergence between the two codebases. Regards Antoine.
On Sat, 16 Apr 2011 16:47:49 +0000 (UTC)
Vinay Sajip
What you're proposing doesn't address the question of who is going to do the ongoing maintenance.
I agree, my suggestion is orthogonal to the question of who maintains stdlib json.
No, that's not what I'm talking about. The json module *is* maintained (take a look at "hg log"), even though it may be less active than simplejson (but simplejson doesn't receive many changes either). I am talking about maintenance of the "shared codebase" you are talking about. Mandating a single codebase between two different languages (Python 2 and Python 3) and two different libraries (json and simplejson) comes at a high maintenance cost, and it's not obvious in your proposal who will bear that cost in the long run (you?). It is not a one-time cost, but an ongoing one.
Bob has said he isn't interested in Python 3, but he has said that "if someone contributes the code to make simplejson work in Python 3 I'm willing to apply the patches run the tests against any future changes."
I can't speak for Bob, but this assumes the patches are not invasive and don't degrade performance. It's not obvious that will be the case.
Bob made a comment in passing that simplejson (Python) is about as fast as stdlib json (C extension), on 2.x.
I think Bob tested with an outdated version of the stdlib json module (2.6 or 2.7, perhaps). In my latest measurements, the 3.2 json C module is as fast as the C simplejson module, the only difference being in parsing of numbers, which is addressed in http://bugs.python.org/issue11856
That may or may not prove to be the case on 3.x, but at least it is now possible to run simplejson on 3.x (Python only, so far) to make a comparison.
Feel free to share your numbers. Regards Antoine.
I agree, my suggestion is orthogonal to the question of who maintains stdlib json. But if the json module is languishing in comparison to simplejson, then bringing the code bases closer together may be worthwhile.
Right: *if* the module is languishing. But it's not. It just diverges.
It may be that no-one is willing or able to serve as an effective maintainer of stdlib json, but assuming that Bob will continue to maintain and improve simplejson
Does it actually need improvement? Regards, Martin
Am 16.04.2011 21:13, schrieb Vinay Sajip:
Martin v. Löwis
writes: Does it actually need improvement?
I can't actually say, but I assume it keeps changing for the better - albeit slowly. I wasn't thinking of specific improvements, just the idea of continuous improvement in general...
Hmm. I cannot believe in the notion of "continuous improvement"; I'd guess that it is rather "continuous change". I can see three possible areas of improvment: 1. Bugs: if there are any, they should clearly be fixed. However, JSON is a simple format, so the implementation should be able to converge to something fairly correct quickly. 2. Performance: there is always room for performance improvements. However, I strongly recommend to not bother unless a severe bottleneck can be demonstrated. 3. API changes: people apparently want JSON to be more flexible wrt. Python types that are not directly supported in JSON. I'd rather take a conservative approach here, involving a lot of people before adding an API feature or even an incompatibility. Regards, Martin
Martin v. Löwis
I can see three possible areas of improvment: 1. Bugs: if there are any, they should clearly be fixed. However, JSON is a simple format, so the implementation should be able to converge to something fairly correct quickly. 2. Performance: there is always room for performance improvements. However, I strongly recommend to not bother unless a severe bottleneck can be demonstrated. 3. API changes: people apparently want JSON to be more flexible wrt. Python types that are not directly supported in JSON. I'd rather take a conservative approach here, involving a lot of people before adding an API feature or even an incompatibility.
I agree with all these points, though I was only thinking of Nos. 1 and 2. Over a longer timeframe, improvements may also come with changes in the spec (unlikely in the short and medium term, but you never know in the long term). Regards, Vinay Sajip
On 16/04/2011 22:28, "Martin v. Löwis" wrote:
Am 16.04.2011 21:13, schrieb Vinay Sajip:
Martin v. Löwis
writes: Does it actually need improvement? I can't actually say, but I assume it keeps changing for the better - albeit slowly. I wasn't thinking of specific improvements, just the idea of continuous improvement in general... Hmm. I cannot believe in the notion of "continuous improvement"; I'd guess that it is rather "continuous change".
I can see three possible areas of improvment: 1. Bugs: if there are any, they should clearly be fixed. However, JSON is a simple format, so the implementation should be able to converge to something fairly correct quickly. 2. Performance: there is always room for performance improvements. However, I strongly recommend to not bother unless a severe bottleneck can be demonstrated. Well, there was a 5x speedup demonstrated comparing simplejson to the standard library json module. That sound like *very* worth pursuing (and crazy not to pursue). I've had json serialisation be the bottleneck in web applications generating several megabytes of json for some requests.
All the best, Michael Foord
3. API changes: people apparently want JSON to be more flexible wrt. Python types that are not directly supported in JSON. I'd rather take a conservative approach here, involving a lot of people before adding an API feature or even an incompatibility.
Regards, Martin _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.u...
-- http://www.voidspace.org.uk/ May you do good and not evil May you find forgiveness for yourself and forgive others May you share freely, never taking more than you give. -- the sqlite blessing http://www.sqlite.org/different.html
On Sat, 16 Apr 2011 23:48:45 +0100
Michael Foord
On 16/04/2011 22:28, "Martin v. Löwis" wrote:
Am 16.04.2011 21:13, schrieb Vinay Sajip:
Martin v. Löwis
writes: Does it actually need improvement? I can't actually say, but I assume it keeps changing for the better - albeit slowly. I wasn't thinking of specific improvements, just the idea of continuous improvement in general... Hmm. I cannot believe in the notion of "continuous improvement"; I'd guess that it is rather "continuous change".
I can see three possible areas of improvment: 1. Bugs: if there are any, they should clearly be fixed. However, JSON is a simple format, so the implementation should be able to converge to something fairly correct quickly. 2. Performance: there is always room for performance improvements. However, I strongly recommend to not bother unless a severe bottleneck can be demonstrated. Well, there was a 5x speedup demonstrated comparing simplejson to the standard library json module.
No.
Well, there was a 5x speedup demonstrated comparing simplejson to the standard library json module.
Can you kindly point to that demonstration?
That sound like *very* worth pursuing (and crazy not to pursue). I've had json serialisation be the bottleneck in web applications generating several megabytes of json for some requests.
Hmm. I'd claim that the web application that needs to generate several megabytes of json for something should be redesigned. I also wonder whether the bottleneck was the *generation*, the transmission, or the processing of the data on the receiving end. Regards, Martin
Antoine Pitrou, 16.04.2011 19:27:
On Sat, 16 Apr 2011 16:47:49 +0000 (UTC) Vinay Sajip wrote:
Bob made a comment in passing that simplejson (Python) is about as fast as stdlib json (C extension), on 2.x.
I think Bob tested with an outdated version of the stdlib json module (2.6 or 2.7, perhaps). In my latest measurements, the 3.2 json C module is as fast as the C simplejson module, the only difference being in parsing of numbers, which is addressed in http://bugs.python.org/issue11856
Ok, but then, what's the purpose of having the old Python implementation in the stdlib? The other Python implementations certainly won't be happy with something that is way slower (algorithmically!) than the current version of the non-stdlib implementation. The fact that the CPython json maintainers are happy with the performance of the C implementation does not mean that the performance of the pure Python implementation can be ignored now. Note: I don't personally care about this question since Cython does not suffer from this issue anyway. This is just the general question about the relation of the C module and the Python module in the stdlib. Functional compatibility is not necessarily enough. Stefan
Antoine Pitrou
Feel free to share your numbers.
I've now got my fork working on Python 3.2 with speedups. According to a non-scientific simple test: Python 2.7 ========== Python version: 2.7.1+ (r271:86832, Apr 11 2011, 18:05:24) [GCC 4.5.2] 11.21484375 KiB read Timing simplejson: 0.271898984909 Timing stdlib json: 0.338716030121 Python 3.2 ========== Python version: 3.2 (r32:88445, Mar 25 2011, 19:28:28) [GCC 4.5.2] 11.21484375 KiB read Timing simplejson: 0.3150200843811035 Timing stdlib json: 0.32146596908569336 Based on this test script: https://gist.github.com/923927 and the simplejson version here: https://github.com/vsajip/simplejson/ Regards, Vinay Sajip
Stefan Behnel
Well, if that is not possible, then the CPython devs will have a hard time maintaining the json accelerator module in the long run. I quickly skipped through the github version in simplejson, and it truly is some complicated piece of code. Not in the sense that the code is ununderstandable, it's actually fairly straight forward string processing code, but it's so extremely optimised and tailored and has so much code duplicated for the bytes and unicode types (apparently following the copy+paste+adapt pattern) that it will be pretty hard to adapt to future changes of CPython, especially the upcoming PEP 393 implementation. Maintaining this is clearly no fun.
Do we even need this complexity in Python 3.x? The speedup code for 2.x is taking different, parallel paths for str and unicode types, either of which might be legitimately passed into JSON APIs in 2.x code. However, in Python 3.x, ISTM we should not be passing in bytes to JSON APIs. So there'd be no equivalent parallel paths for bytes for 3.x speedup code to worry about. Anyway, some simple numbers posted by me elsewhere on this thread show simplejson to be only around 2% faster. Talk of a 5x speedup appears to be comparing non-speeded up vs. speeded up code, in which case the comparison isn't valid. Of course, people might find other workloads which show bigger disparity in performance, or might find something in my 3.x port of simplejson which invalidates my finding of a 2% difference. Regards, Vinay Sajip
On Sun, 17 Apr 2011 09:21:32 +0200
Stefan Behnel
Antoine Pitrou, 16.04.2011 19:27:
On Sat, 16 Apr 2011 16:47:49 +0000 (UTC) Vinay Sajip wrote:
Bob made a comment in passing that simplejson (Python) is about as fast as stdlib json (C extension), on 2.x.
I think Bob tested with an outdated version of the stdlib json module (2.6 or 2.7, perhaps). In my latest measurements, the 3.2 json C module is as fast as the C simplejson module, the only difference being in parsing of numbers, which is addressed in http://bugs.python.org/issue11856
Ok, but then, what's the purpose of having the old Python implementation in the stdlib? The other Python implementations certainly won't be happy with something that is way slower (algorithmically!) than the current version of the non-stdlib implementation.
Again, I don't think it's "way slower" since the code should be almost identical (simplejson hasn't changed much in the last year). That's assuming you measure performance on 3.2 or 3.3, not something older. Besides, the primary selling point of the stdlib implementation is that... it's the stdlib implementation. You have a json serializer/deserializer by default without having to install any third-party package. For most people that's probably sufficient; people with specific needs *may* benefit from installing simplejson. Also, the pure Python paths are still used if you customize some parameters (I don't remember which ones exactly, you could take a look at e.g. Lib/json/encoder.py if you are interested). Regards Antoine.
Vinay Sajip, 17.04.2011 12:33:
Antoine Pitrou writes:
Feel free to share your numbers.
I've now got my fork working on Python 3.2 with speedups. According to a non-scientific simple test:
Python 2.7 ========== Python version: 2.7.1+ (r271:86832, Apr 11 2011, 18:05:24) [GCC 4.5.2] 11.21484375 KiB read Timing simplejson: 0.271898984909 Timing stdlib json: 0.338716030121
Python 3.2 ========== Python version: 3.2 (r32:88445, Mar 25 2011, 19:28:28) [GCC 4.5.2] 11.21484375 KiB read Timing simplejson: 0.3150200843811035 Timing stdlib json: 0.32146596908569336
Based on this test script:
https://gist.github.com/923927
and the simplejson version here:
Is this using the C accelerated version in both cases? What about the pure Python versions? Could you provide numbers for both? Stefan
On 17/04/2011 00:16, Antoine Pitrou wrote:
On Sat, 16 Apr 2011 23:48:45 +0100 Michael Foord
wrote: On 16/04/2011 22:28, "Martin v. Löwis" wrote:
Am 16.04.2011 21:13, schrieb Vinay Sajip:
Martin v. Löwis
writes: Does it actually need improvement? I can't actually say, but I assume it keeps changing for the better - albeit slowly. I wasn't thinking of specific improvements, just the idea of continuous improvement in general... Hmm. I cannot believe in the notion of "continuous improvement"; I'd guess that it is rather "continuous change".
I can see three possible areas of improvment: 1. Bugs: if there are any, they should clearly be fixed. However, JSON is a simple format, so the implementation should be able to converge to something fairly correct quickly. 2. Performance: there is always room for performance improvements. However, I strongly recommend to not bother unless a severe bottleneck can be demonstrated. Well, there was a 5x speedup demonstrated comparing simplejson to the standard library json module. No.
Yes.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.u...
-- http://www.voidspace.org.uk/ May you do good and not evil May you find forgiveness for yourself and forgive others May you share freely, never taking more than you give. -- the sqlite blessing http://www.sqlite.org/different.html
On 17/04/2011 07:28, "Martin v. Löwis" wrote:
Well, there was a 5x speedup demonstrated comparing simplejson to the standard library json module. Can you kindly point to that demonstration?
Hmm... according to a later email in this thread it is 350ms vs 250ms for an 11kb sample. That's a nice speedup but not a 5x one. Bob Ippolito did claim that simplejson was faster than json for real world workloads and I see no reason not to believe him. :-)
That sound like *very* worth pursuing (and crazy not to pursue). I've had json serialisation be the bottleneck in web applications generating several megabytes of json for some requests. Hmm. I'd claim that the web application that needs to generate several megabytes of json for something should be redesigned.
I also wonder whether the bottleneck was the *generation*, The bottleneck was generation. I benchmarked and optimised. (We were using simplejson but I trimmed down the data sent to the absolute minimum needed by the client app rather than merely serialising all the
It was displaying (including sorting) large amounts of information in tables through a web ui. The customer wanted all the information available in the ables, so all the data needed to be sent. We did filtering on the server side where possible to minimize the data sent, but it was ~10mb for many of the queries. We also cached the data on the client and only updated as needed. We could have "redesigned" the customer requirements I suppose... source data from the django model objects - I didn't optimise within simplejson itself...)
the transmission, or the processing of the data on the receiving end.
Processing was done in IronPython in Silverlight using the .NET de-serialization APIs which were dramatically faster than the Python handling on the other side. All the best, Michael
Regards, Martin
-- http://www.voidspace.org.uk/ May you do good and not evil May you find forgiveness for yourself and forgive others May you share freely, never taking more than you give. -- the sqlite blessing http://www.sqlite.org/different.html
On 17/04/2011 17:05, Michael Foord wrote:
On 17/04/2011 00:16, Antoine Pitrou wrote:
On Sat, 16 Apr 2011 23:48:45 +0100 Michael Foord
wrote: On 16/04/2011 22:28, "Martin v. Löwis" wrote:
Am 16.04.2011 21:13, schrieb Vinay Sajip:
Martin v. Löwis
writes: Does it actually need improvement? I can't actually say, but I assume it keeps changing for the better - albeit slowly. I wasn't thinking of specific improvements, just the idea of continuous improvement in general... Hmm. I cannot believe in the notion of "continuous improvement"; I'd guess that it is rather "continuous change".
I can see three possible areas of improvment: 1. Bugs: if there are any, they should clearly be fixed. However, JSON is a simple format, so the implementation should be able to converge to something fairly correct quickly. 2. Performance: there is always room for performance improvements. However, I strongly recommend to not bother unless a severe bottleneck can be demonstrated. Well, there was a 5x speedup demonstrated comparing simplejson to the standard library json module. No.
Yes.
Well, maybe not. :-)
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.u...
-- http://www.voidspace.org.uk/ May you do good and not evil May you find forgiveness for yourself and forgive others May you share freely, never taking more than you give. -- the sqlite blessing http://www.sqlite.org/different.html
On Sun, 17 Apr 2011 17:09:17 +0100
Michael Foord
On 17/04/2011 07:28, "Martin v. Löwis" wrote:
Well, there was a 5x speedup demonstrated comparing simplejson to the standard library json module. Can you kindly point to that demonstration?
Hmm... according to a later email in this thread it is 350ms vs 250ms for an 11kb sample. That's a nice speedup but not a 5x one.
That speedup is actually because of a slowdown in py3k, which should be solved with http://bugs.python.org/issue11856 Regards Antoine.
Stefan Behnel
Is this using the C accelerated version in both cases? What about the pure Python versions? Could you provide numbers for both?
What I posted earlier were C-accelerated timings. I'm not sure exactly how to turn off the speedups for stdlib json. With some assumptions, as listed in this script: https://gist.github.com/924626 I get timings like this: Python version: 3.2 (r32:88445, Mar 25 2011, 19:28:28) [GCC 4.5.2] 11.21484375 KiB read Timing simplejson (with speedups): 0.31562185287475586 Timing stdlib json (with speedups): 0.31923389434814453 Timing simplejson (without speedups): 4.586531162261963 Timing stdlib json (without speedups): 2.5293829441070557 It's quite likely that I've failed to turn off the stdlib json speedups (though I attempted to turn them off for both encoding and decoding), which would explain the big disparity in the non-speedup case. Perhaps someone with more familiarity with stdlib json speedup internals could take a look to see what I've missed? I perhaps can't see the forest for the trees. Regards, Vinay Sajip
On Mon, Apr 18, 2011 at 10:19 AM, Vinay Sajip
It's quite likely that I've failed to turn off the stdlib json speedups (though I attempted to turn them off for both encoding and decoding), which would explain the big disparity in the non-speedup case. Perhaps someone with more familiarity with stdlib json speedup internals could take a look to see what I've missed? I perhaps can't see the forest for the trees.
Consider trying: import sys sys.modules["_json"] = 0 # Block the C extension import json in a fresh interpreter. (This is the same dance test.support.import_fresh_module() uses internally to get unaccelerated modules for testing purposes) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Nick Coghlan
Consider trying:
import sys sys.modules["_json"] = 0 # Block the C extension import json
in a fresh interpreter.
Thanks for the tip. The revised script at https://gist.github.com/924626 shows more believable numbers vis-à-vis the no-speedups case. Interestingly this morning, stdlib json wins in both cases, though undoubtedly YMMV. --------------------------------------------------------------------------- (jst3)vinay@eta-natty:~/projects/scratch$ python time_json.py --no-speedups Python version: 3.2 (r32:88445, Mar 25 2011, 19:28:28) [GCC 4.5.2] 11.21484375 KiB read Timing simplejson (without speedups): 4.585145950317383 Timing stdlib json (without speedups): 3.9949100017547607 (jst3)vinay@eta-natty:~/projects/scratch$ python time_json.py Python version: 3.2 (r32:88445, Mar 25 2011, 19:28:28) [GCC 4.5.2] 11.21484375 KiB read Timing simplejson (with speedups): 0.3202629089355469 Timing stdlib json (with speedups): 0.3200039863586426 --------------------------------------------------------------------------- Regards, Vinay Sajip
participants (12)
-
"Martin v. Löwis"
-
Antoine Pitrou
-
Bob Ippolito
-
Dirkjan Ochtman
-
Michael Foord
-
Nick Coghlan
-
Raymond Hettinger
-
Sandro Tosi
-
Stefan Behnel
-
Vinay Sajip
-
Xavier Morel
-
Xavier Morel