I'm definitely interested and willing to clean up + contribute our benchmarks.
On a side note, I'm a bit skeptical that there can be a single benchmark suite that satisfies everyone. I would imagine that there will still be projects with specific use-cases they prioritize (such as Pyston with webserver workloads), or that have some idea that their users will be "non-representative" in some way. One example of that is the emphasis on warmup vs steady-state performance, which can be reflected in different measurement methodologies -- I don't think there's a single right answer to the question "how much does warmup matter".
But anyway, I'm still definitely +1 on the idea of merging all the benchmarks together, and I think that that will be better than the current situation. I'm imagining that we can at least have a common language for discussing these things ("Pyston prefers to use the flags `--webserver --include-warmup`"). I also see quite a few blog posts / academic papers on Python performance that seem to get led astray by the confusing benchmark situation, and I think having a blessed set of benchmarks (even if different people use them in different ways) would still be a huge step forward.
kmod