Compare the performance between inline assembly and C style Intel intrinsic

2021/05/26

Modern CPU has many acceralating instructions, like Intel® SSE, AVX, AVX-512, and more. In order to invoke these instructions conveniently, Intel provided many C style functions that provide access to them, which can be found at here. In this article, we are going to compare the performance between those C style functions and its original inline assembly. The test environment is Visual Studio 2019 and Windows 10.

We use _mm256_add_ps/vaddps as a test case. But I think other instructions should have similar results. The test code is listed below:

#include <immintrin.h>
#include <cstdio>
#include <chrono>

#define high_res_now() std::chrono::high_resolution_clock::now()
#define time_elapsed(t) std::chrono::duration_cast<std::chrono::nanoseconds>(high_res_now()-t).count()

/* 
	Compare the performance between inline assembly and C style Intel intrinsic
	Using _mm256_add_ps/vaddps as a test case

	__m256 _mm256_add_ps (__m256 a, __m256 b)
	#include <immintrin.h>
	Instruction: vaddps ymm, ymm, ymm
	CPUID Flags: AVX

	Operation:
	FOR j := 0 to 7
		i := j*32
		dst[i+31:i] := a[i+31:i] + b[i+31:i]
	ENDFOR
	dst[MAX:256] := 0
*/

int main()
{
	const unsigned int test_num = 10000;
	float mem_addr_a[8] = { 0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7 }, mem_addr_b[8] = { 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8 };
	__m256 x = _mm256_load_ps(mem_addr_a), y = _mm256_load_ps(mem_addr_b), z; // cannot use volatile, further investigation needed
	auto start = high_res_now();
	for (int i = 0; i < test_num; i++)
	{
		z = _mm256_add_ps(x, y);
	}
	printf("_mm256_add_ps time: %lld ns\n", time_elapsed(start));
	start = high_res_now();
	__asm
	{
		vmovups	ymm0, ymmword ptr[x];
		vmovups ymm1, ymmword ptr[y];
	};
	for (int i = 0; i < test_num; i++)
	{
		// z = _mm256_add_ps(x, y);
		__asm
		{
			vaddps ymm2, ymm1, ymm0;
			vmovups ymmword ptr[z], ymm2;
		};
	}
	printf("inline assembly time: %lld ns\n", time_elapsed(start));
	return 0;
}

The output is:

_mm256_add_ps time: 15800 ns
inline assembly time: 11000 ns

It is shown that the inline assembly is performed better than the C style function. But why? In order to investigate the reason, I disassembled the function in VS.

z = _mm256_add_ps(x, y);
    00201FCC  vmovups     ymm0,ymmword ptr [x]  
    00201FD4  vaddps      ymm0,ymm0,ymmword ptr [y]  
    00201FDC  vmovups     ymmword ptr [ebp-380h],ymm0  
    00201FE4  vmovups     ymm0,ymmword ptr [ebp-380h]  
    00201FEC  vmovups     ymmword ptr [z],ymm0 

Compared with our inline assembly, _mm256_add_ps only use ymm0 SIMD register, hence incurring extra excuations of vmovups. But consider that 256-bit SIMD register is rare, if we invoke several SIMD instructions in the same program, maybe such behavior (or optimization) could lead to better performance.

I am working on implementing the linux GCC version of this test…