I've done this experiment.  It was about 12% on my box.  Later, once I
had everything else setup so I could run two threads simultaneously, I
found much worse costs.  All those literals become shared objects that
create contention.

I'm now working on an approach that writes out refcounts in batches to
reduce contention.  The initial cost is much higher, but it scales
better too.  I've currently got it to just under 50% cost, meaning two
threads is a slight net gain.

