RE-POST February: SPO600 Week 3 Part 1

This post will be about what I have learned during my third week of Software Portability and Optimization (SPO600) class.

Compilers: Targets & Tuning

Now that I have a bit of knowledge about the 6502 machine, it is time to learn the 64-bit assembler.

The compiler would pick up the characteristics of the machine being used automatically, which will be used as the default on knowing how to build the software.

There are 2 reasons, why it is better not to have the compiler pick up those characteristics.

  1. We want to build cross-platform with the toolchain.
  2. When the target is not a completely different type of system, but potentially different classes of the same architecture.

There are 2 separate and related concepts that the compiler controls what it outputs when building the software. This is the main topic of this post:

  1. Target
  2. Tuning

There are 2 options:

  • Architecture flag - specify the architecture that we want to target
    • -march=target - controls which instructions can be included in the instruction stream
  • Tuning flag - disables feature foo
    • -mtune=target - know something about the system that we are building for and make some decisions based on that. Controls the instruction tuning.
      • Useful for adjustment of software performance to work best with the latest and greatest machine.

-m means machine

The x86_64 has continued to advance throughout these years due to the vector or single instruction multiple data. In other words, the ability to perform one operation, such as arithmetic or logical or comparison operations to perform a single operation with multiple pieces of data in parallel.

Single Instruction/Multiple Data (SIMD)

SIMD is the latest and greatest instruction that doesn't run on older processors.

Currently, it is no longer possible to build software that fits all. There would be a possibility of losing 30-50% of performance by targeting older versions of an architecture or breaking compatibility which is a significant problem.

Number of Registers

Another significant problem is in respect to the number of registers. In this course, 6502 assembler is being used which has 3 registers available (X, Y, and A). However, if a machine has 5 registers and some of the code uses the other 2 registers. Then, if we take the same code and run it in the 6502 assembler, whenever we try to access the 2 registers, the software blows up because there is no such register in the 6502 machine.

ARM64 and x86-64 Architecture

In class, the instructor has given us access to a 64-bit ARM machine and a 64-bit x86 machine, where the steps are found here. A couple of steps were different in my machine, which will be discussed in another post.

Commands

Descriptions

ls <folder>

list folders and files in the current directory. Can also mention the <folder> name to see the content inside that specified folder.

uname -a

Information about the kernel and architecture.

free

Can look at how much RAM memory each machine has.

less /proc/cpuinfo

Look at CPU info.

clear

clear the terminal

ll

AKA ls -l, which means long list.

Table 1: List of commands that can be used in both 64-bit ARM machines and 64-bit x86 machines.

ARM64 Machine

It has a Cortex A72 processor.

Commands

Descriptions

less adjust_channels

Contains handwritten assembler that's embedded right into the middle of C code.

make clean

Wipe out the software.

make

Set to build the software and will perform some tasks.

qemu-aarch64

emulation tool to jump over illegal instructions.

time

time how long the command takes to run.

set|grep PATH

show library search path.

echo $PATH

show search path for executables.

Table 2: List of commands used for an ARM 64 machine.

After doing the make command, we can see -march=armv8-a+sve2 option. Where:

  • -m - flag that specifies information about the machine
  • armv8 - specifies the architecture
    • arm - family of architectures
    • v8 - the architectural level version 8 which is the first of the 64-bit architecture levels from ARM. Note that there may be a decimal number after the 8 which means there were some sort of minor improvements that the company has made to that architecture through the years.
  • -a - the particular ARM architecture level is targeted at an application processor and that is as opposed to the processors that are intended for an embedded context or real-time context where ultra-fine timing control is important.
  • +sve2 - additional feature which tells the compiler it is ok to use instructions that use SVE2 feature.

Therefore, all the instructions that are compatible with ARMv8 device and any instructions, assembly language, or machine language instructions that use the SVE2 capability are ok to include in the software that's going to be emitted.

The make command will also run some tests:

./image-adjust4 tests/input/bree.jpg 2.0 2.0 2.0 tests/output/bree4c.jpg

The 2.0 are adjustment factors where the red, green, and blue channels are going to be doubled in brightness.

However, when running this command, the output would be:

illegal instruction (core dumped)

This is due to software being built in ARMv8 but with SVE2 capabilities which the CPU does not have any idea of what instruction is supposed to do because it doesn't include SVE2.

Solutions:

  • Rebuild the software for just ARMv8
  • Use a software emulation tool where it will jump in every time there is an illegal instruction.
    • use qemu-aarch64 before the testing line. This emulation would run at 1% of the speed of the hardware.
    • Not really useful in real-life deployment. For example in x86, it installs a co-processor (second chip) to handle floating-point and their operations. However, this is slow.

In this case, it would be ok because most of the instructions are going to execute fine. The few instructions that would use SVE2 instructions are going to be handled in software.

Different Ways to Build Software

  1. For the lowest common denominator. Pick an architecture level that is vaguely modern but not the latest and greatest that only 2% of the population own.
  2. In such a way that at runtime, it figures out the capabilities of the machine. Therefore, would test the machine and check what capability this machine is able to execute. After making that decision, decide between different code paths:
    • This is the most common method currently being used.
    • Optimize at the library level and relax on the hardware capabilities.
    • Might have different versions of the functions and do an assessment about what the machine does and based on that, a certain version of that particular function would be used.
    • Detection of the hardware is done at the beginning so that we don't constantly do it for each different function. Then use a function pointer to get the specific version of the function that is appropriate for the machine.
    • Significant burden on the developer. Only use it for heavy-duty number crunching or data crunching (multimedia, cryptography, data compression, decompression, AI, etc.)
To know more about the options, go to the following documentation:

The AMD64 documentation can be downloaded by going through the link below:

iFunc

A toolchain that would make a decision once when the software initializes and remembers that decision by setting the function pointer appropriately. There will be more information about it next week.


Comments

Popular posts from this blog