A look at SSE code generation
August 11th, 2011Recently I needed to code a YUY2 to 8-bit grayscale routine in SSE2. The fragment from the first version I came up with is below:
while (num_pixels >= 32)
{
// y u y v
const __m128i y_0 = _mm_and_si128(y_component_mask, _mm_loadu_si128(src_pixel++));
const __m128i y_1 = _mm_and_si128(y_component_mask, _mm_loadu_si128(src_pixel++));
const __m128i y_2 = _mm_and_si128(y_component_mask, _mm_loadu_si128(src_pixel++));
const __m128i y_3 = _mm_and_si128(y_component_mask, _mm_loadu_si128(src_pixel++));
_mm_store_si128(dst_pixel++, _mm_packus_epi16(y_0, y_1));
_mm_store_si128(dst_pixel++, _mm_packus_epi16(y_2, y_3));
num_pixels -= 32;
}
This seemed simple enough and I thought VC8 would generate efficient code. A quick look at the generated assembly shows:
$LL6@yuy2_to_gr: mov edx, eax movdqu xmm4, XMMWORD PTR [edx] add eax, 16 ; 00000010H mov esi, eax movdqu xmm5, XMMWORD PTR [esi] add eax, 16 ; 00000010H mov edi, eax movdqu xmm1, XMMWORD PTR [edi] add eax, 16 ; 00000010H mov edi, eax movdqu xmm2, XMMWORD PTR [edi] movdqa xmm3, xmm0 pand xmm3, xmm4 mov edi, ecx movdqa xmm4, xmm0 pand xmm4, xmm5 packuswb xmm3, xmm4 movdqa XMMWORD PTR [edi], xmm3 add ecx, 16 ; 00000010H movdqa xmm3, xmm0 pand xmm3, xmm1 mov edx, ecx movdqa xmm1, xmm0 add eax, 16 ; 00000010H add ecx, 16 ; 00000010H pand xmm1, xmm2 sub ebx, 32 ; 00000020H sub DWORD PTR tv149[esp+48], 1 packuswb xmm3, xmm1 movdqa XMMWORD PTR [edx], xmm3 jne SHORT $LL6@yuy2_to_gr
Now this code doesn’t look terribly nice. I would have expected the use of the indexed addressing modes.
After changing the source to the following which is semantically the same:
while (num_pixels >= 32)
{
// y u y v
const __m128i y_0 = _mm_and_si128(y_component_mask, _mm_loadu_si128(src_pixel));
const __m128i y_1 = _mm_and_si128(y_component_mask, _mm_loadu_si128(src_pixel+1));
const __m128i y_2 = _mm_and_si128(y_component_mask, _mm_loadu_si128(src_pixel+2));
const __m128i y_3 = _mm_and_si128(y_component_mask, _mm_loadu_si128(src_pixel+3));
_mm_store_si128(dst_pixel, _mm_packus_epi16(y_0, y_1));
_mm_store_si128(dst_pixel+1, _mm_packus_epi16(y_2, y_3));
src_pixel += 4;
dst_pixel += 2;
num_pixels -= 32;
}
we get the following code which looks a lot nicer:
$LL6@yuy2_to_gr: movdqu xmm4, XMMWORD PTR [eax] movdqu xmm5, XMMWORD PTR [eax+16] movdqu xmm1, XMMWORD PTR [eax+32] movdqu xmm2, XMMWORD PTR [eax+48] movdqa xmm3, xmm0 pand xmm3, xmm4 movdqa xmm4, xmm0 pand xmm4, xmm5 packuswb xmm3, xmm4 movdqa XMMWORD PTR [ecx], xmm3 movdqa xmm3, xmm0 pand xmm3, xmm1 movdqa xmm1, xmm0 pand xmm1, xmm2 packuswb xmm3, xmm1 movdqa XMMWORD PTR [ecx+16], xmm3 add eax, 64 ; 00000040H add ecx, 32 ; 00000020H sub edx, 32 ; 00000020H sub esi, 1 jne SHORT $LL6@yuy2_to_gr
The company I work for also compiles with gcc-4.4 for linux so a quick check of what is generated with -O3 for the first version of the code:
.L4: movdqu (%eax), %xmm3 movdqu 16(%eax), %xmm4 movdqu 32(%eax), %xmm1 subl $32, %ecx addl $64, %eax pand %xmm0, %xmm3 movdqu -16(%eax), %xmm2 pand %xmm0, %xmm4 packuswb %xmm4, %xmm3 movdqa %xmm3, (%edx) addl $32, %edx cmpl $31, %ecx pand %xmm0, %xmm1 pand %xmm0, %xmm2 packuswb %xmm2, %xmm1 movdqa %xmm1, -16(%edx) ja .L4
and the second version:
.L4: subl $32, %ecx movdqu (%eax), %xmm3 movdqu 16(%eax), %xmm4 movdqu 32(%eax), %xmm1 movdqu 48(%eax), %xmm2 pand %xmm0, %xmm3 pand %xmm0, %xmm4 pand %xmm0, %xmm1 pand %xmm0, %xmm2 packuswb %xmm4, %xmm3 packuswb %xmm2, %xmm1 movdqa %xmm3, (%edx) movdqa %xmm1, 16(%edx) addl $64, %eax addl $32, %edx cmpl $31, %ecx ja .L4
gcc generates the same code, in a different order, for both versions. From now on I will write my code using ptr+n to help the compiler to pick the indexed addressing mode instructions.


