On Sep 28, 2014, at 13:17, Sturla Molden <sturla.molden@gmail.com> wrote:

Andrew Barnert <abarnert@yahoo.com.dmarc.invalid>
wrote:

On what modern CPU does unlikely have any effect at all? x86 has an
opcode to provide static branch prediction hints, but it's been a no-op
since Core 2; ARM doesn't have one; I don't know about other instruction
sets but I'd be surprised if they did.

http://madalanarayana.wordpress.com/2013/08/29/__builtin_expect-a-must-for-stack-developers/

The example in this post shows the exact opposite of what it purports to: the generated code puts the unlikely i++ operation immediately after the conditional branch; because Haswell processors assume, in the absence of any information, that forward branches are unlikely, this will cause the wrong branch to be speculatively executed. In other words, gcc has completely ignored the builtin_expect here--as it often does.

Also note the comment in the quoted source:

In general, you should prefer to use
     actual profile feedback for this (`-fprofile-arcs'), as
     programmers are notoriously bad at predicting how their programs
     actually perform

http://benyossef.com/helping-the-compiler-help-you/

This one vaguely waves its hands at the idea without providing any examples, before concluding:

It should be noted that GCC also provide a run time parameter -fprofile-arcs, which can profile the code for the actual statistics for each branch and the use of it should be prefered above guessing.

Meanwhile, this whole thing started with you saying that branch prediction means we can add conditional checks "with impunity". The exact opposite is true. On older processors, we _could_ issue checks with impunity; branch prediction means they're now an order of magnitude more expensive than they used to be unless we're very careful. The ability to hint the CPU by rearranging code (whether manually, with builtin_expect, or using PGO) partly mitigated this effect, but it doesn't reverse it.

Which means, exactly as I said at the start, that the check for non-heap it not free. Unnecessary refcounts are also not free. Which one is more costly? Is either one costly enough to matter? Hell if I know; that's the kind of thing you pretty much have to test. Trying to reason it from first principles is hard enough even if you get all the principles right, but even harder if you're thinking in terms of P4 chips.