Optimizing meshes for the iPhone

The PowerVR guide says if you order triangle indices as if they were triangle strips you will get a speed boost, because the PowerVR chip implementation uses  triangle strips internally. The PowerVR SDK has an example to shows this, using a model of a sphere. I assumed that the example was an extreme case and you wouldn’t see such a big improvement for real models. However, I was pleasantly surprised to see that it actually did give a big improvement in a real world example – cutting down render time from 38ms to 35.5ms in a scene with 18 skinned meshes.

I used tootle to re-order the indices:

int result = TootleOptimizeVCache(pIndices,
      numTriIndices/3, m_listVertexArray[0]->GetNumVertices(),
      TOOTLE_DEFAULT_VCACHE_SIZE, pIndices, NULL, TOOTLE_VCACHE_LSTRIPS);
if (result != TOOTLE_OK)
    cout << "could not optimise!" << endl;

It’s important to use TOOTLE_VCACHE_LSTRIPS, because the default ordering is designed for PC GPUs and won’t work well on the iPhone.
Also, you have to reorder the vertex data to match the order in the triangle index array. Tootle can be found here.
Unfortunately, Tootle crashes for certain meshes. If there was source code, I probably could have fixed that – but there isn’t :(.

Advertisements

Accelerating Software skinning with VFP assembler

I was trying to get my engine perform better on older iDevices. I need to be able to render 18 characters on screen simultaneously, however on the 1st gen iPod touch it takes 63ms to render the scene. I thought I’d try to use vfp assembly to speed it up, using code from this site: http://code.google.com/p/vfpmathlibrary/

Initially, it didn’t make any difference at all. This was because it was GPU bound. So, I reduced the scene to 8 skinned meshes – which would show up optimisation improvements better.

The assembler code still didn’t speed things that much. I ran the code analyzer tool and found the piece of code that was most of the time was the code that transforms the vertices with the current matrix of the joint:

for (n = 0; n < (int)m_listVertex.size(); n++)
{
weight = m_listWeight[n];
index = m_listVertex[n]*3;
matrix.TransformPoint(&pOrigData[index],weight, &pCurrData[index]);
}

void Matrix::TransformPoint(const float* pInVertex, float weight, float* pOutVertex) const
{
pOutVertex[0] += weight*(pInVertex[0]*m[0] + pInVertex[1]*m[4] + pInVertex[2]*m[8] + m[12]);
pOutVertex[1] += weight*(pInVertex[0]*m[1] + pInVertex[1]*m[5] + pInVertex[2]*m[9] + m[13]);
pOutVertex[2] += weight*(pInVertex[0]*m[2] + pInVertex[1]*m[6] + pInVertex[2]*m[10] + m[14]);
}

There was a function similiar to this in the vfpmathlibrary. So I modified it and this is the result:

// Sets length and stride to 0.
#define VFP_VECTOR_LENGTH_ZERO “fmrx    r0, fpscr            \n\t” \
“bic     r0, r0, #0x00370000  \n\t” \
“fmxr    fpscr, r0            \n\t”

// Set vector length. VEC_LENGTH has to be bitween 0 for length 1 and 3 for length 4.
#define VFP_VECTOR_LENGTH(VEC_LENGTH) “fmrx    r0, fpscr                         \n\t” \
“bic     r0, r0, #0x00370000               \n\t” \
“orr     r0, r0, #0x000” #VEC_LENGTH “0000 \n\t” \
“fmxr    fpscr, r0                         \n\t”

void Matrix::TransformPoint(const float* pInVertex, float weight, float* pOutVertex) const
{
asm volatile (
// Load the whole matrix.
“fldmias  %[matrix], {s8-s23}     \n\t”
// Load vector to scalar bank.
“fldmias  %[pInVertex], {s0-s2}      \n\t”
// Load weight to scalar bank.
“fldmias  %[weight], {s3}      \n\t”
“fldmias  %[pOutVertex], {s28-s30}      \n\t”

VFP_VECTOR_LENGTH(2)

“fmuls s24, s8, s0        \n\t”
“fmacs s24, s12, s1       \n\t”
“fmacs s24, s16, s2       \n\t”
“fadds s24, s24, s20        \n\t”
“fmuls s24, s24, s3        \n\t”
“fadds s24, s24, s28        \n\t”

// Save vector.
“fstmias  %[out], {s24-s26}  \n\t”

VFP_VECTOR_LENGTH_ZERO
:
: [matrix] “r” (m),

[pInVertex] “r” (pInVertex),

[weight] “r” (&weight),

[out] “r” (pOutVertex)
: “r0”, “cc”,
“s0”,  “s1”,  “s2”,  “s3”,
“s8”,  “s9”,  “s10”, “s11”, “s12”, “s13”, “s14”, “s15”,
“s16”, “s17”, “s18”, “s19”, “s20”, “s21”, “s22”, “s23”,
“s24”, “s25”, “s26”, “s28”, “s29”, “s30”
);
}

It took me quite a while to figure out the assembler, because you need to reference several very technical books to figure it out. I’d like to make this job easier for any interested programmers out there. So, just let me explain it line by line.

On the first line you have: asm volatile(…); .This instructs gcc that the stuff in the ( ) brackets is assembler code. volatile means, tells gcc not to try to “optimize” the code.

Then you have a number of strings each string is an arm vfp instruction.

The vfp has 4 banks of 8 single precision floating point registers:

The idea is that you can do up to 8 similar floating point operations at the same time.  If you look at the formula that we’re trying to implement again:

pOutVertex[0] += weight*(pInVertex[0]*m[0] + pInVertex[1]*m[4] + pInVertex[2]*m[8] + m[12]);
pOutVertex[1] += weight*(pInVertex[0]*m[1] + pInVertex[1]*m[5] + pInVertex[2]*m[9] + m[13]);
pOutVertex[2] += weight*(pInVertex[0]*m[2] + pInVertex[1]*m[6] + pInVertex[2]*m[10] + m[14]);

You see that we could do pInVertex[0]*m[0], pInVertex[0]*m[1] and pInVertex[0]*m[2] all in one instruction. And the rest of the formula is done the same way – three operations all in the one go.

So, let’s go through the code line by line.

First you have: “fldmias  %[matrix], {s8-s23}     \n\t”

fldmais loads memory contents into several registers. Here, it’s loading the entire matrix (16 floats) into s8-s23. (It doesn’t actually use all the data in the matrix, but it’s easier to do it all in one instruction).

The “matrix” is an assember variable defined in the section at the bottom, but we’ll cover that later.

Notice, there is \n\t at the end of the line. Thats just to format the assember code. It’s just something that you have to add to each assembler line.

Next, we have: “fldmias  %[pInVertex], {s0-s2}      \n\t”

This loads the 3 vertex co-ords into s0-s2 – i.e. bank 0. Bank zero is different than the other banks, but I’ll go into that later.

Then, we load the weight and the output vertex co-ords into other registers:

“fldmias  %[weight], {s3}      \n\t”
“fldmias  %[
pOutVertex], {s28-s30}      \n\t”

So, now we have everything loaded.

Next we have to tell the vpf how many ops we do at the same time. We have a macro:

VFP_VECTOR_LENGTH(2)

This sets the vector length setting to 3 (it’s actually one more than the specified parameter).

So, now it’s time to do the fun part: the math ops!

The first op is: “fmuls s24, s8, s0        \n\t”

This is equivalent to three single vector ops:

fmuls 24, s8, s0
fmuls 25, s9, s0
fmuls 26, s10, s0

s0 is in bank 0 and this bank has special function: the address never increments for a vector operation ( a so-called scalar vector). Now, if you remember we had the matrix data in s8-s23 and the vertex data in s0-s3. So this function does the following calculation:

s24 = pInValues[0]*m[0]
s25 = pInValues[0]*m[1] 
s26 = pInValues[0]*m[2] 

We are always dumping the results into s24-s26, which we use as temp registers.

The next instruction is:

“fmacs s24, s12, s1       \n\t”

fmacs multiplies, then adds. So this instruction is the equivilant to:

s24 += pInValues[1]*m[4]
s25 += pInValues[1]*m[5]
s26 += pInValues[1]*m[6] 

Then

“fmacs s24, s16, s2       \n\t”

As you probably guess, this is the equivilant to:

s24 += pInValues[2]*m[8]
s25 += pInValues[2]*m[9]
s26 += pInValues[2]*m[10]

Then:

“fadds s24, s24, s20        \n\t”

As you might guess this is addition:

s24 += m[12]
s25 += m[13]
s26 += m[14]

Then multiply by the weight which is stored in s3:

“fmuls s24, s24, s3        \n\t”

s24 *= weight
s25 *= weight 
s26 *= weight

Finally, add to the current vertex data (which we stored in s28-s30):

“fadds s24, s24, s28        \n\t”

s24 += pOutValues[0]
s25 += pOutValues[1] 
s26 += pOutValues[2]

Then, we load the result back into the current vertex data:

“fstmias  %[out], {s24-s26}  \n\t”

And the VFP_VECTOR_LENGTH_ZERO macro restores the vector size back to the default value of 1 (otherwise all hell would break loose).

The stuff at the end tells gcc the inputs and output of the function. There always has to be three sections seperated by colons :

 : // output parameters
 : [matrix] "r" (m),
   [pInVertex] "r" (pInVertex),
   [weight] "r" (&weight),
   [pOutVertex] "r" (pOutVertex)            // input parameters
 : "r0", "cc",  "s0",  "s1",  "s2",  "s3",
 "s8",  "s9",  "s10", "s11", "s12", "s13", "s14", "s15",
 "s16", "s17", "s18", "s19", "s20", "s21", "s22", "s23",
 "s24", "s25", "s26", "s28", "s29", "s30"  // clobber list

The first section is the output parameters, which is blank. This doesn’t make any sense, because really it should have pOutVertex, but apparently it just works that way – don’t ask me why.

The next section is the input parameters. First you have the variable name used in the assembler code surrounded by square brackets [], then you have a “r” then the variable name as used in the c++ part of the code in round brackets (). Note: this has to be an address, *not* a value, that’s why the weight has a & in front of it.

The next section is what is affectionately known as “the clobber list“. This tells gcc what registers we have used in the program. If you accidentally forget to include a register in the clobber list, it’ll crash, so this is important.

I found that the program could be speeded up even more by moving the VFP_VECTOR_LENGTH macros from TransformPoint to outside of the main loop:

SetVectorLen2();
for (n = 0; n < (int)m_listVertex.size(); n++)
{
weight = m_listWeight[n];
index = m_listVertex[n]*3;
matrix.TransformPoint(&pOrigData[index], weight, &pCurrData[index]);
}
SetVectorLen0();

All in all, the assembler code reduces the total render time from 34ms to 30.5ms (when rendering 8 skinned meshes), which is not bad.

If you try to run this code on a newer device, like a iPhone 3GS, you’re in store for a surprise as the 3GS has no VFP unit and it actually reduces the performance by a large amount :-D.

But don’t worry about this because the 3GS goes so fast it doesn’t really need assembler.

Reducing game start up time

A lot game developers consider the start-up time of a game to be of little importance. Unfortunately, the truth is that users do not like looking at the hourglass. So, here are two tips to reduce start-up time:

1. Use Targa instead of PNG.

PNGs take several times longer to decode than compressed Targa. I think it’s around 5 times slower. (I once made some measurements on the iPhone, but I lost the figures <doh>).  On the other hand, compressed Targa files are about 40% bigger, but disk space usually isn’t that critical. You can use the code here to decode targa files: http://dmr.ath.cx/gfx/targa/.

An interesting side effect of this optimization is that it also reduces development time. Everytime you start up the debugger it has to decode all those 1024×1024 textures, before you can begin.

2. Read files by blocks, not word by word

Preferably, read the whole file in one go. This is how you do it with C code:

FILE* pFile;
pFile = fopen(fileName.c_str(), "rb");
if (pFile == NULL) return;
fseek(pFile, 0, 2 /* SEEK_END */);
fileSize = (int)ftell(pFile);
fseek(pFile, 0, 0 /* SEEK_SET */);
pBuffer = new char[filesize];
int bytesRead = (int)fread(pBuffer, fileSize, 1, pFile);
fclose(pFile);

Some old performance tricks

The documentation for the iphone’s VFP processor says that all floating point instructions take 1 cycle, apart from divide and square root, which take 15 cycles.

This leads to a very obvious optimisation, to cache your divide operations.

In other words if you have code like this:

float fl = 1.0007f;
for (i = 0; i < 16; i++)
matrix.Set(i, matrix[i]/fl);

you can increase performance by doing the divide outside the loop:

float fl = 1.0f/1.0007f;
for (i = 0; i < 16; i++)
matrix.Set(i, matrix[i]*fl);

I measured this and found that it does indeed make a big improvement.

Also, I thought it would be interesting to try out the old Quake inverse square root trick.

float InvSqrt(float x){
   float xhalf = 0.5f * x;
   int i = *(int*)&x; // store floating-point bits in integer
   i = 0x5f3759d5 - (i >> 1); // initial guess for Newton's method
   x = *(float*)&i; // convert new bits into float
   x = x*(1.5f - xhalf*x*x); // One round of Newton's method
   return x;
}

This code works on the iPhone, but the results are less accurate.

float fl = 2.0f;
result = 1.f / sqrt ( fl );   // gives 0.707106769
result = InvSqrt(fl);   // gives 0.706930041

the actual value should be 0.70710678

I measured the performance and found that it’s a touch faster, but not much.

OpenGL Features Removed from iPhone

Full screen anti-aliasing is a feature that greatly improves the graphic quality of any game and is supported by the powervr chip. Unfortunately, you don’t have access to the feature, because the iPhone does not use EGL. Instead it uses the framebuffer extension to bind to a surface and in order to enable anti-aliasing you have to call eglChooseConfig with the EGL_SAMPLE_BUFFERS parameter and this is not possible.

Funnily, enough I found a project that appears to use EGL with the iPhone: http://code.google.com/p/iphone-dj/

However, looking a bit close I see that the egl.h file does not even exist in the SDK, so the code in that project should not even compile (pretty wierd that).

This also means that the pbuffer isn’t supported, as that is enabled via eglChooseConfig as well. And to top this off I also heard that the GL_IMG_vertex_program extension is disabled (I think this prevents you from using bump mapping and cartoon rendering – as demoed in the powervr sdk)

Edit:

I reciently discovered that there exists an unofficial iPhone SDK, which includes egl. This is probably what iphone-dj was using. So, you probably can get anti-aliasing working if you don’t mind using an unofficial SDK.

Edit, 7 Oct 2010

Apple seemingly have resolved this in iOS4. They’ve implemented MSAA using a extension –

glResolveMultisampleFramebufferAPPLE.

However, there is a catch, it’s implemented in software on older devices (read slow) and with shaders on newer devices (but using more memory). The PowerVR MBX manual says the following:

…Because POWERVR is a tile-based renderer, no extra memory is required for the frame and depth buffers. i.e. they will keep their original size during the whole process. What happens in the hardware is that more tiles are rendered to match the super-sampled size but the scaling-down process is transparent. This is a clear advantage compared to immediate more renderers, which have to allocate precious video memory to cater for the super-sampled frame and depth buffers.

So, basically on older devices – where we are stretching the limits of the GPU – we have to make do with a software MSAA, even though it’s normally available implemented in hardware for free and without requiring more memory.

Developing iPhone apps with Visual Studio

Origonally I thought XCode was pretty good, indeed it does do somethings better than Visual Studio. However, the support for c++ is lacking, especially if you use templates. XCode can’t display template variables in the debugger window, nor does “Intelisense” work. One of my  projects has a tonne of template based classes, so that’s what pushed me to see if I could do the work on Windows instead.

The PowerVR SDK comes with a Windows OpenGL ES emulator. It didn’t take me long to get my code working with that, because all my code is portable: c++, stl and OpenGL. Straightaway my productivity doubled. So, now I do the implementation and debugging on Visual Studio and periodically switch to Mac to make device builds.

Image handling on iPhone

At first it looks good… but (like most things) when you get into the details, you realize there are some klunky bits. So, I was just looking how to open a png file and it’s dead easy and you can do all sorts of sophisticated transformations, but if – for any reason, you need a simple pointer to the pixel data – you’re boned. It requires you to go around in hoops. And to load a texture you need the pixel data. So, how does Apple do it in their examples? Have a look:

http://read.pudn.com/downloads120/sourcecode/macos/510268/CrashLanding/Classes/Texture2D.m__.htm

yep, they’re pretty much forced to go around in hoops themselves. This must be some pretty inefficient code, though because you have to copy the data twice, once from the file, then from the UIImage to some sort of wierd memory bitmap and then uploading with glTexImage2D.

I already have a tga class, so I’ll just store my images as tga for the time being. It looks like it’ll work a lot better than using the cocoa classes.

Anyways, In the final game I should use powervr compressed images. I have to look into that more, but I assume the data is just dumped into glTexImage2D as it’s read from the file. Powervr compressed images are a dark art. They can only be created via a windows tool and can’t be displayed in the emulator. Quite, ironic that Apple says in their guidlines  “use PVRTC”, and yet there’s no mac tool to create it.