From >200ms down to a usable 30ms.

OpenPandora forum user sebt3 started testing out scaling filter shaders on the SGX530 for use the OpenPandora console, the first working version ended up way too slow to be of any use.
Another forum user, FSO, improved the speed a bit (160ms to 180ms per frame).

I asked sebt3 if he’d let me have a go at it.

This was the 2nd iteration with the improvements from forum user FSO.
(180ms per frame)

precision mediump float;
varying vec2 v_texCoord[5];
varying vec2 pos;
uniform sampler2D s_texture0;
uniform vec4 u_param;
void main()
{
	    vec4 E = texture2D(s_texture0, v_texCoord[0]);
	    vec4 D = texture2D(s_texture0, v_texCoord[1]);
	    vec4 F = texture2D(s_texture0, v_texCoord[2]);
	    vec4 H = texture2D(s_texture0, v_texCoord[3]);
	    vec4 B = texture2D(s_texture0, v_texCoord[4]);
	    vec2 p = fract(pos);
		vec4 tmp1 = p.x < 0.5 ? D : F;
		vec4 tmp2 = p.y < 0.5 ? H : B;
		vec4 tmp3 = D == F || H == B ? E : tmp1;
		gl_FragColor = tmp1 == tmp2 ? tmp3 : E;
}

For the next iteration sebt3 changed the vec4 to vec3 which reduced the time to between 130ms per frame

Next I changed all the vectors I could to low precision (lowp) which brought it down to 95ms per frame, a good improvement but still not nearly enough.

precision mediump float;
varying vec2 v_texCoord[5];
varying vec2 pos;
uniform lowp sampler2D s_texture0;
uniform vec4 u_param;

void main()
{
	lowp vec3 E = texture2D(s_texture0, v_texCoord[0]).xyz;
	lowp vec3 D = texture2D(s_texture0, v_texCoord[1]).xyz;
	lowp vec3 F = texture2D(s_texture0, v_texCoord[2]).xyz;
	lowp vec3 H = texture2D(s_texture0, v_texCoord[3]).xyz;
	lowp vec3 B = texture2D(s_texture0, v_texCoord[4]).xyz;
	lowp vec2 p = fract(pos);

	lowp vec3 tmp1 = p.x < 0.5 ? D : F;
	lowp vec3 tmp2 = p.y < 0.5 ? H : B;
	lowp vec3 tmp3 = D == F || H == B ? E : tmp1;
	gl_FragColor.xyz = tmp1 == tmp2 ? tmp3 : E;
}

Then I tried attacking the boolean conditional operators and replacing them by vector math operations which would provide a similar truth table, changing the last line to

	gl_FragColor.xyz = ((tmp1 - tmp2) != vec3(0.0)) || ((D - F) * (H - B) == vec3(0.0)) ? E : tmp1;

further reduced the time to 80ms per frame at the expense of not behaving exactly the same in some rare cases (while remaining visually pleasing)

Then some reordering of the code to improve the most common case and removing the last boolean OR operator.

precision mediump float;
varying vec2 v_texCoord[5];
varying vec2 pos;
uniform lowp sampler2D s_texture0;
uniform vec4 u_param;

void main()
{
        lowp vec3 E = texture2D(s_texture0, v_texCoord[0]).xyz;
        lowp vec3 D = texture2D(s_texture0, v_texCoord[1]).xyz;
        lowp vec3 F = texture2D(s_texture0, v_texCoord[2]).xyz;
        lowp vec3 H = texture2D(s_texture0, v_texCoord[3]).xyz;
        lowp vec3 B = texture2D(s_texture0, v_texCoord[4]).xyz;

        if ((D - F) * (H - B) == vec3(0.0)) {
                gl_FragColor.xyz = E;
        } else {
                lowp vec2 p = fract(pos);
                lowp vec3 tmp1 = p.x < 0.5 ? D : F;
                lowp vec3 tmp2 = p.y < 0.5 ? H : B;
                gl_FragColor.xyz = ((tmp1 - tmp2) != vec3(0.0)) ? E : tmp1;
        }
}

Tadaaaaaaaah! from 180ms down to a very playable 30ms per frame.

This demonstrate the need to avoid boolean operators such as || and && in shaders as much as possible.
They have a particularity that make them very bad in shader code: they conditionally evaluate the right-side expression depending on the left side.
This is an old C language feature for programs run on old CPUs but causes code branching which is atrocious on GPUs.
Using vector math to evaluate everything ends up being faster due to the very costly code branches on modern processors and GPUs.

Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post
«
Next Post
»