[YLUG] Why is this so much faster on an Opteron?

Mon Sep 24 16:43:37 BST 2007

On Mon, 2007-09-24 at 15:20 +0100, Robert Hulme wrote:
> I have this code:
> 
> #include <stdio.h>
> 
> int main(char **argv,int argc) {
>  double f=0;
>  int i=0;
>  int j=0;
>  int c=1;
>  for(i=0;i<10000;i++)
>  for(j=0;j<10000;j++,c++)
>    f=f+0.505/(j+i/(1.0+j+i));
>  return (int)f%10;
> }
> 
> If I compile it with -O2 it takes about 40 seconds to run on my
> Pentium D (dual core pentium 4) desktop, but only about 2 seconds
> (when using time) to run on a 2.2Ghz Opteron.
> 
> Why is there such an enormous difference?

You've sort of touched on the problem with discovering SSE/SMD
instructions mask the issue.

P4 CPUs are notoriously bad at handling infinities, as well as NaNs and
denormals.  Usually this isn't a huge problem, but on the first
iteration, i == j == 0, so the division becomes "0.505/(0/1)", i.e
infinity.  P4 CPUs are notoriously bad at handling infinities, as well
as NaNs and denormals.  Once f == infinity, all your future divisions
are also affected.

Try starting i and j at 1, and seeing the speed increase.

On the other hand, using SSE instructions for this incurs almost no
penalties for handling these numbers.

See http://www.cygnus-software.com/papers/x86andinfinity.html for more
details.

Gavin