The documentation for the iphone’s VFP processor says that all floating point instructions take 1 cycle, apart from divide and square root, which take 15 cycles.
This leads to a very obvious optimisation, to cache your divide operations.
In other words if you have code like this:
float fl = 1.0007f;
for (i = 0; i < 16; i++)
matrix.Set(i, matrix[i]/fl);
you can increase performance by doing the divide outside the loop:
float fl = 1.0f/1.0007f;
for (i = 0; i < 16; i++)
matrix.Set(i, matrix[i]*fl);
I measured this and found that it does indeed make a big improvement.
Also, I thought it would be interesting to try out the old Quake inverse square root trick.
float InvSqrt(float x){
float xhalf = 0.5f * x;
int i = *(int*)&x; // store floating-point bits in integer
i = 0x5f3759d5 - (i >> 1); // initial guess for Newton's method
x = *(float*)&i; // convert new bits into float
x = x*(1.5f - xhalf*x*x); // One round of Newton's method
return x;
}
This code works on the iPhone, but the results are less accurate.
float fl = 2.0f;
result = 1.f / sqrt ( fl ); // gives 0.707106769
result = InvSqrt(fl); // gives 0.706930041
the actual value should be 0.70710678
I measured the performance and found that it’s a touch faster, but not much.
February 18, 2009 at 1:05 pm |
This is not quite true… Yes it may take 1 cycle to issue floating point operations but there is a throughput of cycles to wait if the next instruction relies on the previous operation. For example:
fadds s0,s0,s1
fadds s0,s0,s2
The pipeline states that it will stall for 8 cycles less one for forwarding; so 7 cycles will be wasted. The trick is to put useful instructions between the above example.
This is just an example but I recommend reading the VFP manual from the arm site because there are even more rules to get the best from the pipeline.
February 19, 2009 at 9:36 am |
Thnaks for the info Andy. So, do you know how often it stalls in practice?
February 20, 2009 at 5:22 am |
Most of the time. The trick is to do smart ideas like split the reg banks into two for loops – so that you load one half of the regs and calculate with the other half.
If you calculate something like dot products and have a few, do it like this:
fmuls s0,s8,s8
fmuls s1,s12,s12
fmuls s2,s16,s16
fmuls s3,s20,s20
// Stalls here for 7-3 = 4 cycles… Put some other code here
fmacs s0,s9,s9
fmacs s1,s13,s13
fmacs s2,s17,s17
fmacs s3,s21,s21
// Stalls here for 7-3 = 4 cycles… Put some other code here
fmacs s0,s10,s10
fmacs s1,s14,s14
fmacs s2,s18,s18
fmacs s3,s22,s22
All the rules are in the Arm FPU manual on the arm site, as there are loads more things to consider.
February 20, 2009 at 12:54 pm |
Is it possible that apple’s c++ compiler could be re-ordering the instructions automatically to take advantage of the FPU?
February 20, 2009 at 3:00 pm |
Unfortunately, no… If you look at the asm source code generated by GCCE compiler you would find that it generates pretty poor code. It does not use the Vector operations of the VFP. The only way to get the best performance is to go to write assembly code for VFP.
Also, if you are writing C code for the VFP make sure you turn off the thumb option in the project settings, as any floating point operation will call an Arm function.
I also tried the carmack method for the InvSqrt but realized if you pipeline the square root and division you get quicker results using the square root and division of the VFP. Like:
fsqrts s0,s0
// Do something for 19 cycles
fdivs s0,s11,s0
// Do something for 19 cycles then collect the data
You can do other floating point stuff (as long as you don’t do a division or square root) or integer stuff to do burn off cycles between the Do Something, as these work from a different pipeline.
April 7, 2009 at 1:54 am |
When removing the refinement step from the above formula, you still get a maximum relative error of 3.4% which is good enough for normalizing a vector, and gcc generates the following code:
ldr r3, x
sub r0, r3, r0, asr #1
…
x:
.long 1597463007
That’s a total of 4 cycles if I’m not mistaken.
Now, I can hardly see what I could do during 38 cycles when i’m normalizing my vectors :-/
April 7, 2009 at 1:57 am |
Oh and I measured in actual code that the above code is only 2.2 times faster than the one generated using a plain 1.f/sqrtf(f), which tells me GCC does a pretty decent job.
Options used: -O3 -fomit-frame-pointer -fstrict-aliasing -marm -march=armv6 -mcpu=arm1176jzf-s -mfloat-abi=softfp -mfpu=vfp