Modern CPU has many acceralating instructions, like Intel® SSE, AVX, AVX-512, and more. In order to invoke these instructions conveniently, Intel provided many C style functions that provide access to them, which can be found at here. In this article, we are going to compare the performance between those C style functions and its original inline assembly. The test environment is Visual Studio 2019 and Windows 10.
We use _mm256_add_ps/vaddps as a test case. But I think other instructions should have similar results. The test code is listed below:
#include <immintrin.h>
#include <cstdio>
#include <chrono>
#define high_res_now() std::chrono::high_resolution_clock::now()
#define time_elapsed(t) std::chrono::duration_cast<std::chrono::nanoseconds>(high_res_now()-t).count()
/*
Compare the performance between inline assembly and C style Intel intrinsic
Using _mm256_add_ps/vaddps as a test case
__m256 _mm256_add_ps (__m256 a, __m256 b)
#include <immintrin.h>
Instruction: vaddps ymm, ymm, ymm
CPUID Flags: AVX
Operation:
FOR j := 0 to 7
i := j*32
dst[i+31:i] := a[i+31:i] + b[i+31:i]
ENDFOR
dst[MAX:256] := 0
*/
int main()
{
const unsigned int test_num = 10000;
float mem_addr_a[8] = { 0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7 }, mem_addr_b[8] = { 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8 };
__m256 x = _mm256_load_ps(mem_addr_a), y = _mm256_load_ps(mem_addr_b), z; // cannot use volatile, further investigation needed
auto start = high_res_now();
for (int i = 0; i < test_num; i++)
{
z = _mm256_add_ps(x, y);
}
printf("_mm256_add_ps time: %lld ns\n", time_elapsed(start));
start = high_res_now();
__asm
{
vmovups ymm0, ymmword ptr[x];
vmovups ymm1, ymmword ptr[y];
};
for (int i = 0; i < test_num; i++)
{
// z = _mm256_add_ps(x, y);
__asm
{
vaddps ymm2, ymm1, ymm0;
vmovups ymmword ptr[z], ymm2;
};
}
printf("inline assembly time: %lld ns\n", time_elapsed(start));
return 0;
}
The output is:
_mm256_add_ps time: 15800 ns
inline assembly time: 11000 ns
It is shown that the inline assembly is performed better than the C style function. But why? In order to investigate the reason, I disassembled the function in VS.
z = _mm256_add_ps(x, y);
00201FCC vmovups ymm0,ymmword ptr [x]
00201FD4 vaddps ymm0,ymm0,ymmword ptr [y]
00201FDC vmovups ymmword ptr [ebp-380h],ymm0
00201FE4 vmovups ymm0,ymmword ptr [ebp-380h]
00201FEC vmovups ymmword ptr [z],ymm0
Compared with our inline assembly, _mm256_add_ps only use ymm0 SIMD register, hence incurring extra excuations of vmovups. But consider that 256-bit SIMD register is rare, if we invoke several SIMD instructions in the same program, maybe such behavior (or optimization) could lead to better performance.
I am working on implementing the linux GCC version of this test…