Some old performance tricks

The documentation for the iphone’s VFP processor says that all floating point instructions take 1 cycle, apart from divide and square root, which take 15 cycles.

This leads to a very obvious optimisation, to cache your divide operations.

In other words if you have code like this:

float fl = 1.0007f;
for (i = 0; i < 16; i++)
matrix.Set(i, matrix[i]/fl);

you can increase performance by doing the divide outside the loop:

float fl = 1.0f/1.0007f;
for (i = 0; i < 16; i++)
matrix.Set(i, matrix[i]*fl);

I measured this and found that it does indeed make a big improvement.

Also, I thought it would be interesting to try out the old Quake inverse square root trick.

float InvSqrt(float x){
   float xhalf = 0.5f * x;
   int i = *(int*)&x; // store floating-point bits in integer
   i = 0x5f3759d5 - (i >> 1); // initial guess for Newton's method
   x = *(float*)&i; // convert new bits into float
   x = x*(1.5f - xhalf*x*x); // One round of Newton's method
   return x;
}

This code works on the iPhone, but the results are less accurate.

float fl = 2.0f;
result = 1.f / sqrt ( fl );   // gives 0.707106769
result = InvSqrt(fl);   // gives 0.706930041

the actual value should be 0.70710678

I measured the performance and found that it’s a touch faster, but not much.

Advertisements

7 Responses to Some old performance tricks

  1. Andy says:

    This is not quite true… Yes it may take 1 cycle to issue floating point operations but there is a throughput of cycles to wait if the next instruction relies on the previous operation. For example:

    fadds s0,s0,s1
    fadds s0,s0,s2

    The pipeline states that it will stall for 8 cycles less one for forwarding; so 7 cycles will be wasted. The trick is to put useful instructions between the above example.

    This is just an example but I recommend reading the VFP manual from the arm site because there are even more rules to get the best from the pipeline.

  2. Andy says:

    Most of the time. The trick is to do smart ideas like split the reg banks into two for loops – so that you load one half of the regs and calculate with the other half.

    If you calculate something like dot products and have a few, do it like this:

    fmuls s0,s8,s8
    fmuls s1,s12,s12
    fmuls s2,s16,s16
    fmuls s3,s20,s20
    // Stalls here for 7-3 = 4 cycles… Put some other code here
    fmacs s0,s9,s9
    fmacs s1,s13,s13
    fmacs s2,s17,s17
    fmacs s3,s21,s21
    // Stalls here for 7-3 = 4 cycles… Put some other code here
    fmacs s0,s10,s10
    fmacs s1,s14,s14
    fmacs s2,s18,s18
    fmacs s3,s22,s22

    All the rules are in the Arm FPU manual on the arm site, as there are loads more things to consider.

    • lefty3 says:

      Is it possible that apple’s c++ compiler could be re-ordering the instructions automatically to take advantage of the FPU?

  3. Andy says:

    Unfortunately, no… If you look at the asm source code generated by GCCE compiler you would find that it generates pretty poor code. It does not use the Vector operations of the VFP. The only way to get the best performance is to go to write assembly code for VFP.

    Also, if you are writing C code for the VFP make sure you turn off the thumb option in the project settings, as any floating point operation will call an Arm function.

    I also tried the carmack method for the InvSqrt but realized if you pipeline the square root and division you get quicker results using the square root and division of the VFP. Like:

    fsqrts s0,s0
    // Do something for 19 cycles
    fdivs s0,s11,s0
    // Do something for 19 cycles then collect the data

    You can do other floating point stuff (as long as you don’t do a division or square root) or integer stuff to do burn off cycles between the Do Something, as these work from a different pipeline.

  4. jcak77 says:

    When removing the refinement step from the above formula, you still get a maximum relative error of 3.4% which is good enough for normalizing a vector, and gcc generates the following code:

    ldr r3, x
    sub r0, r3, r0, asr #1

    x:
    .long 1597463007

    That’s a total of 4 cycles if I’m not mistaken.
    Now, I can hardly see what I could do during 38 cycles when i’m normalizing my vectors :-/

  5. jcak77 says:

    Oh and I measured in actual code that the above code is only 2.2 times faster than the one generated using a plain 1.f/sqrtf(f), which tells me GCC does a pretty decent job.

    Options used: -O3 -fomit-frame-pointer -fstrict-aliasing -marm -march=armv6 -mcpu=arm1176jzf-s -mfloat-abi=softfp -mfpu=vfp

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: