RE-POST February: SPO600 Week 5 Part 2

This post would be a continuation about what I have learned during my fifth week of Software Portability and Optimization (SPO600) class.

Single Instruction Multiple Data (SIMD)

An important performance-enhancing capability in modern processors. Very useful for graphics and sounds, or any multimedia for 3D. It is also useful for AI work.

SIMD can also be referred to vectorization.

Vectors is an array of values.

Register size	name
Full register: 128 bits wide	16 bytes
2 x 64	V0.2D
4 x 32	V0.4S
8 x 16	V0.8H
16 x 8	V0.16B

S - single width = 32 bits at a time

D - double width = 64 bits at a time

H - half word = 16 bits at a time

B - Bytes = 8 bits at a time

To refer to a single value:

Register size	name
128	Q0
64	D0
32	S0
16	H0
8	B0

These are mostly used when we want to summarize our lanes.

Example:

In a simple instruction:

add r0,r1,r2 /* r0 = r1 + r2 */

The vector version of that instruction, using 4 x 32 bits values:

add v0.4s,v1.4s,v2.4s /* v0.4s = v1.4s + v2.4s */

v0.4s - add vector register 0 divided into 4 lanes, 32 bits

Notice that in the simple instruction, we are only adding 1 value, while in the vector version, we are adding 4 separate lanes of 32-bit values in parallel.

In this case here, the vector version would be 4 times faster than the simple instructions. However, in the vector version, we would end up with 4 separate numbers. Therefore, we would need to add all 4 lanes together.

addv s0, v0.4s /* sum of lanes of v0.4s */

There are 3 ways to use vector instructions:

Autovectorization - Vectorization by the compiler
Inline assembler - inserting assembly into a high-level language
Compiler intrinsic - using function0like language extensions built into the compile

Autovectorization

Get the compiler to identify where SIMD can be used advantageously.

The benefit of using this, it is that it takes the least effort for the programmer.

For the compiler to vectorize, we would need to add a parameter letting them know that it is safe to do so, else it would not do it since they don't have enough information to judge if it is safe.

In GCC

To turn on vectorization:

-ftree-vectorize /* Now included with -O2 (formerly -O3) */

To turn off:

-fno-tree-vectorize

For diagnostics:


    -fopt-info-vec-all        /* Information about all vectorization decisions */
    -fopt-info-vec-missed        /* Information about missed vectorization */

Inline Assembler

In GCC

__asm__ (template : outputs : inputs : clobbers)
    OR
    asm(template : outputs : inputs : clobbers)

template - is the actual assembly language code with register macros (%0 %1 %2).

output - output variables in a comma-separated list ["=r"(Cvariable)].

input - input variables in a comma-separated list ["r"(Cvariable)].

clobbers - comma-separated list of registers and the keyword "memory" which are the registers affected by the assembly-language code excluding the one in output and input.

Example

    #include <stdio.h>
    int main() {
        int a = 3;

        int b = 12;

        int c;

    
    // Format of inline assembler:

        // __asm__ (:asm code template" : outputs : inputs : clobbers)
    

        // The next line is equivalent to: c = a + b

        __asm__ ("add %0, %1, %2" : "=r"(c) : "r"(a), "r"(b) : )
    

        printf("%d\n", c);

    }

Can compile it with: cc -03 add.c -o add

Compiler Intrinsic

They are compiler extensions and look like functions. They are usually named with a double underscore at the start __name.

The idea behind this is to avoid using an assembler. However, there is no real advantage of using this instead of an assembler, since we still need to know the assembly language instructions and it is still platform-specific.

Search This Blog

SPO600 Blog