RE-POST February: SPO600 Week 5 Part 2
This post would be a continuation about what I have learned during my fifth week of Software Portability and Optimization (SPO600) class.
Single Instruction Multiple Data (SIMD)
An important performance-enhancing capability in modern processors. Very useful for graphics and sounds, or any multimedia for 3D. It is also useful for AI work.
SIMD can also be referred to vectorization.
Vectors is an array of values.
Register size | name |
---|---|
Full register: 128 bits wide | 16 bytes |
2 x 64 | V0.2D |
4 x 32 | V0.4S |
8 x 16 | V0.8H |
16 x 8 | V0.16B |
S - single width = 32 bits at a time
D - double width = 64 bits at a time
H - half word = 16 bits at a time
B - Bytes = 8 bits at a time
To refer to a single value:
Register size | name |
---|---|
128 | Q0 |
64 | D0 |
32 | S0 |
16 | H0 |
8 | B0 |
Example:
In a simple instruction:
add r0,r1,r2 /* r0 = r1 + r2 */
The vector version of that instruction, using 4 x 32 bits values:
add v0.4s,v1.4s,v2.4s /* v0.4s = v1.4s + v2.4s */
v0.4s - add vector register 0 divided into 4 lanes, 32 bits
Notice that in the simple instruction, we are only adding 1 value, while in the vector version, we are adding 4 separate lanes of 32-bit values in parallel.
In this case here, the vector version would be 4 times faster than the simple instructions. However, in the vector version, we would end up with 4 separate numbers. Therefore, we would need to add all 4 lanes together.
addv s0, v0.4s /* sum of lanes of v0.4s */
There are 3 ways to use vector instructions:
- Autovectorization - Vectorization by the compiler
- Inline assembler - inserting assembly into a high-level language
- Compiler intrinsic - using function0like language extensions built into the compile
Autovectorization
Get the compiler to identify where SIMD can be used advantageously.
The benefit of using this, it is that it takes the least effort for the programmer.
For the compiler to vectorize, we would need to add a parameter letting them know that it is safe to do so, else it would not do it since they don't have enough information to judge if it is safe.
In GCC
To turn on vectorization:
-ftree-vectorize /* Now included with -O2 (formerly -O3) */
To turn off:
-fno-tree-vectorize
For diagnostics:
-fopt-info-vec-all /* Information about all vectorization decisions */
-fopt-info-vec-missed /* Information about missed vectorization */
Inline Assembler
In GCC
__asm__ (template : outputs : inputs : clobbers)
OR
asm(template : outputs : inputs : clobbers)
template - is the actual assembly language code with register macros (%0 %1 %2).
output - output variables in a comma-separated list ["=r"(Cvariable)].
input - input variables in a comma-separated list ["r"(Cvariable)].
clobbers - comma-separated list of registers and the keyword "memory" which are the registers affected by the assembly-language code excluding the one in output and input.
Example
#include <stdio.h>
int main() {
int a = 3;
int b = 12;
int c;
// Format of inline assembler:
// __asm__ (:asm code template" : outputs : inputs : clobbers)
// The next line is equivalent to: c = a + b
__asm__ ("add %0, %1, %2" : "=r"(c) : "r"(a), "r"(b) : )
printf("%d\n", c);
}
Can compile it with: cc -03 add.c -o add
Compiler Intrinsic
They are compiler extensions and look like functions. They are usually named with a double underscore at the start __name.
The idea behind this is to avoid using an assembler. However, there is no real advantage of using this instead of an assembler, since we still need to know the assembly language instructions and it is still platform-specific.
Comments
Post a Comment