I have originally planned to post the proposal on bpo, but things turned out unexpectedly (for me), so I am returning here.

I wrote the patch (for the Python part). If anyone is interested it is here:

The patch follows the original idea about serializing the custom type to JSON and I believe it is "as simple as it gets" except the JSON number validity check, which turned out to be problematic.

I run some timeit benchmarks on my code, and compared it to simplejson. The test I run was:

py -m timeit -s "import simplejson as sjson; from decimal import Decimal; d=[Decimal('1.000000000000000001')]*10000" "sjson.dumps(d)"

(my code)
py -m timeit -s "import json; from decimal import Decimal; d=[Decimal('1.000000000000000001')]*10000" "json.dumps(d, dump_as_number=Decimal)"

Since my code runs in pure Python only, I disabled the C lib in simplejson too. Here are the results:
simplejson - with C code: 50 loops, best of 5: 5.89 msec per loop
simplejson - pure Python: 20 loops, best of 5: 10.5 msec per loop
json_patch (regex check): 10 loops, best of 5: 21.3 msec per loop
json_patch (float check): 20 loops, best of 5: 15.1 msec per loop
json_patch (no check): 50 loops, best of 5: 9.75 msec per loop

The different "checks" mark different _check_json_num implementations (included in the code). "float check" is used just as an example of something accessible (and possibly faster), but I guess there could be cases which float accepts, but which are not valid JSON numbers.

The JSON validity check turned out to be the cause of the performance hit. simpljson does not do any validity check on Decimal output, so it is on par in perf with "no check" (I guess it is a tad bit slower because it implements and handles more features in the encoder loop).

I previously argued with Paul that making an assumption about the object output validity based on its type is not safe (which I still hold), but making it safe in this particular case presents the performance hit I cannot accept, or to word it differently, if I should choose between stdlib json and simplejson, while knowing that the stdlib runs 50-100% slower (but safe), I would choose simplejson.

From the previous discussion here I also understood that letting the custom type serialize without the validity check is unacceptable for some. Since I am basically indifferent in this matter, I would not argue about it either.

Which leaves me with only one possible outcome (which seems to be acceptable) - porting the Decimal handling from simplejson to stdlib. Apart from the fact that simplejson already has it, so if I need it, I could use simplejson, the other part is that whoever pulled simplejson code into stdlib either made deliberate effort to remove this particular functionality (if it was present at the time) or never considered it worthy to add (once it was added to simplejson).

Second point is that when looking at the code in the stdlib and in simplejson, it is clear that simplejson has more features (and seems also to be more actively maintained) than the stdlib code, so importing one particular feature into the stdlib just to make it "less inferior" without any additional benefit seems like a waste of time.

Why simplejson remained separated from the main CPython is also a question (I guess there was/is a reason), because it seems like including the code completely and maintain it inside CPython could be better use of the resources.