A look at SSE code generation

August 11th, 2011

Recently I needed to code a YUY2 to 8-bit grayscale routine in SSE2. The fragment from the first version I came up with is below:

while (num_pixels >= 32)
{
    // y u y v
    const __m128i y_0 = _mm_and_si128(y_component_mask, _mm_loadu_si128(src_pixel++));
    const __m128i y_1 = _mm_and_si128(y_component_mask, _mm_loadu_si128(src_pixel++));
    const __m128i y_2 = _mm_and_si128(y_component_mask, _mm_loadu_si128(src_pixel++));
    const __m128i y_3 = _mm_and_si128(y_component_mask, _mm_loadu_si128(src_pixel++));
    _mm_store_si128(dst_pixel++, _mm_packus_epi16(y_0, y_1));
    _mm_store_si128(dst_pixel++, _mm_packus_epi16(y_2, y_3));

    num_pixels -= 32;

}

This seemed simple enough and I thought VC8 would generate efficient code. A quick look at the generated assembly shows:

$LL6@yuy2_to_gr:
	mov	edx, eax
	movdqu	xmm4, XMMWORD PTR [edx]
	add	eax, 16					; 00000010H
	mov	esi, eax
	movdqu	xmm5, XMMWORD PTR [esi]
	add	eax, 16					; 00000010H
	mov	edi, eax
	movdqu	xmm1, XMMWORD PTR [edi]
	add	eax, 16					; 00000010H
	mov	edi, eax
	movdqu	xmm2, XMMWORD PTR [edi]
	movdqa	xmm3, xmm0
	pand	xmm3, xmm4
	mov	edi, ecx
	movdqa	xmm4, xmm0
	pand	xmm4, xmm5
	packuswb xmm3, xmm4
	movdqa	XMMWORD PTR [edi], xmm3
	add	ecx, 16					; 00000010H
	movdqa	xmm3, xmm0
	pand	xmm3, xmm1
	mov	edx, ecx
	movdqa	xmm1, xmm0
	add	eax, 16					; 00000010H
	add	ecx, 16					; 00000010H
	pand	xmm1, xmm2
	sub	ebx, 32					; 00000020H
	sub	DWORD PTR tv149[esp+48], 1
	packuswb xmm3, xmm1
	movdqa	XMMWORD PTR [edx], xmm3
	jne	SHORT $LL6@yuy2_to_gr

Now this code doesn’t look terribly nice. I would have expected the use of the indexed addressing modes.
After changing the source to the following which is semantically the same:

while (num_pixels >= 32)
{
    // y u y v
    const __m128i y_0 = _mm_and_si128(y_component_mask, _mm_loadu_si128(src_pixel));
    const __m128i y_1 = _mm_and_si128(y_component_mask, _mm_loadu_si128(src_pixel+1));
    const __m128i y_2 = _mm_and_si128(y_component_mask, _mm_loadu_si128(src_pixel+2));
    const __m128i y_3 = _mm_and_si128(y_component_mask, _mm_loadu_si128(src_pixel+3));
    _mm_store_si128(dst_pixel, _mm_packus_epi16(y_0, y_1));
    _mm_store_si128(dst_pixel+1, _mm_packus_epi16(y_2, y_3));
    src_pixel += 4;
    dst_pixel += 2;

    num_pixels -= 32;

}

we get the following code which looks a lot nicer:

$LL6@yuy2_to_gr:
	movdqu	xmm4, XMMWORD PTR [eax]
	movdqu	xmm5, XMMWORD PTR [eax+16]
	movdqu	xmm1, XMMWORD PTR [eax+32]
	movdqu	xmm2, XMMWORD PTR [eax+48]
	movdqa	xmm3, xmm0
	pand	xmm3, xmm4
	movdqa	xmm4, xmm0
	pand	xmm4, xmm5
	packuswb xmm3, xmm4
	movdqa	XMMWORD PTR [ecx], xmm3
	movdqa	xmm3, xmm0
	pand	xmm3, xmm1
	movdqa	xmm1, xmm0
	pand	xmm1, xmm2
	packuswb xmm3, xmm1
	movdqa	XMMWORD PTR [ecx+16], xmm3
	add	eax, 64					; 00000040H
	add	ecx, 32					; 00000020H
	sub	edx, 32					; 00000020H
	sub	esi, 1
	jne	SHORT $LL6@yuy2_to_gr

The company I work for also compiles with gcc-4.4 for linux so a quick check of what is generated with -O3 for the first version of the code:

.L4:
	movdqu	(%eax), %xmm3
	movdqu	16(%eax), %xmm4
	movdqu	32(%eax), %xmm1
	subl	$32, %ecx
	addl	$64, %eax
	pand	%xmm0, %xmm3
	movdqu	-16(%eax), %xmm2
	pand	%xmm0, %xmm4
	packuswb	%xmm4, %xmm3
	movdqa	%xmm3, (%edx)
	addl	$32, %edx
	cmpl	$31, %ecx
	pand	%xmm0, %xmm1
	pand	%xmm0, %xmm2
	packuswb	%xmm2, %xmm1
	movdqa	%xmm1, -16(%edx)
	ja	.L4

and the second version:

.L4:
	subl	$32, %ecx
	movdqu	(%eax), %xmm3
	movdqu	16(%eax), %xmm4
	movdqu	32(%eax), %xmm1
	movdqu	48(%eax), %xmm2
	pand	%xmm0, %xmm3
	pand	%xmm0, %xmm4
	pand	%xmm0, %xmm1
	pand	%xmm0, %xmm2
	packuswb	%xmm4, %xmm3
	packuswb	%xmm2, %xmm1
	movdqa	%xmm3, (%edx)
	movdqa	%xmm1, 16(%edx)
	addl	$64, %eax
	addl	$32, %edx
	cmpl	$31, %ecx
	ja	.L4

gcc generates the same code, in a different order, for both versions. From now on I will write my code using ptr+n to help the compiler to pick the indexed addressing mode instructions.

Finding heap growth in a C++ app on Windows

March 10th, 2011

For the last week I have been tasked with solving a memory growth problem in one of our products. The company I work for, not Initek, provides a computer vision API and our demo application shows off our tech. The problem is it slowly leaks memory.

To be honest, I have never really had to find this kind of problem before. My first approach was to create an allocation hook function using _CrtSetAllocHook and record the amount of memory freed from a given stack backtrace.

So after redirecting all our IPP memory allocations to an aligned debug malloc and grabbing some data, guess what.
The amount of memory missing could not be accounted for in the debug C run-time heap.

The next step was using UMDH from Debugging Tools for Windows. This is a great utility that I had not seen before. I have read the extra tools section of the Debugging Tools for Windows doco but somehow I missed this. Anyway, UMDH dumps a snapshot of the user mode heap with backtraces and allows a comparison between 2 snapshots.

So to prepare we enable user mode stack tracing for a particular executable with:
gflags -i demo.exe +ust

Now, all heap allocations have there backtrace stored using RtlCaptureStackBackTrace.

Unfortunately, reading the log comparison all of my memory leaks were attribute to a callstack
that looked like:

+  27cda4 ( 46fc65 - 1f2ec1)   1c67 allocs      BackTrace63BBA9C
+     d00 (  1c67 -   f67)      BackTrace63BBA9C        allocations
ntdll!RtlAllocateHeap+00000274

Looking at the disassembly for RtlCaptureStackBackTrace I noticed the implementation is based on walking non-FPO stack frames.
A typical stdcall function prologue looks like this:

push ebp,
mov ebp, esp
sub esp, <local variable space>
...
pop ebp
retn <num arguments>

The code expects to see pairs of dwords
<return_address>
<saved_ebp_of_previous_frame>

This code obviously breaks if a function in the backtrace use FPO and trashes the ebp register.

Further investigation in the code shows a check of the flag at ntdll!RtlpFuzzyStackTracesEnabled. Setting this to 1 in windbg using ed ntdll!RtlpFuzzyStackTracesEnabled 1 should enable the backtrace.

I tried this at home on Windows XP using a test app by starting the app in windbg, setting the flag at the initial loader breakpoint and running and all worked well. Unfortunately on Windows 7 this did not work as the debugger stops at a breakpoint in a function in ntdll.

So giving up too easily, I mean it worked on Windows XP, I set a conditional breakpoint and sampled the callstacks and memory size using windbg. After a lot of tedious looking at callstacks I actually found that only one of our OpenMP parallel regions was leaking 0×310 bytes every time it was executed. The actual allocation was inside the Intel OpenMP runtime. I did a bit of digging but could not see any real reason why that particular region was different from any other one we had in our code.

The next step was to slowly gut the contents of the OpenMP parallel region to determine why it behaved differently to the other regions. I ended up with the following code and still had the memory leak!

push_and_set_num_threads(2);
#pragma omp parallel
{
LOG_PRINT("hello");
}
pop_num_threads();

Now what this code does is save the previous max threads, sets a new maximum (equivalent to the num_threads clause) and restores the previous value. Now on my computer this is effectively alternating between setting the number of threads to 4 and 2.
After removing the num_threads code the memory leak suddenly disappeared!

I created the following simple app and tested against Intel libiomp5md.dll version 5.0.2010.924 and it leaks, while version 5.0.2007.1022 doesn’t.

int main(int argc, char*argv[])
{
volatile int count = 0;

while (1)
{

#pragma omp parallel num_threads(2)
{
++count;
}
#pragma omp parallel num_threads(4)
{
++count;
}
}
return 0;
}

VS2008 - Cannot find one or more components.

December 22nd, 2010

After spending 4 hours rebooting,  repairing, uninstalling and then reinstalling VS2008 and receiving the following dialog every time VS is launched I finally figured out what was going wrong and thought I would share.

Evil messagebox, I hate you:

VS2008 - Cannot find one or more components dialog.

Earlier today I looked at a minidump for a customer. I fired up VS2008, loaded the minidump and began debugging. All was working fine. The other VS2008 app I was using continued to work fine. Now, Visual Studio has been spraying pdb symbol files all over our debugging working directories for a while now and we just periodically remove them whenever we feel the need.

The symbol server cache path for some reason is ignored sometimes, haven’t really looked into it.

Symbol Options DialogAnyway, running the debugger over the minidump with the working directory being the same directory as devenv.exe allowed the symbol server to download the DLL’s used by the customers machine (as referenced by the minidump) and store them in subdirectories under devenv.exe. The next time I launched devenv.exe it was picking these DLL’s which were mismatched for my computer causing the evil dialog above.

Deleting the symbol server generated directories under C:\Program Files (x86)\Microsoft Visual Studio 9.0\Common7\IDE fixed the problem.

The key to finding this was to enable FLG_SHOW_LDR_SNAPS for the devenv.exe process, run devenv.exe from windbg and go through the mass of debug output  that appears in the debuggers output window.

To set the flag for devenv.exe add the key HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\devenv.exe with the value 2.

Below is a snippet of the output that points to what’s going on. The key here is in knowing uxtheme.dll is a windows system library and should be loaded from somwhere under c:\windows.

16f0:17bc @ 01261439 - LdrpHandleOneOldFormatImportDescriptor - INFO: DLL "c:\Program Files (x86)\Microsoft Visual Studio 9.0\Common7\IDE\msenv.dll" imports "UxTheme.dll"
16f0:17bc @ 01261439 - LdrpLoadImportModule - ENTER: DLL name: UxTheme.dll DLL path: C:\Program Files (x86)\Microsoft Visual Studio 9.0\Common7\IDE;C:\Windows\system32;C:\Windows\system;C:\Windows;.;C:\Program Files\Debugging Tools for Windows (x64)\winext\arcade;C:\Program Files (x86)\Windows Resource Kits\Tools\;c:\python26;C:\Perl\bin\;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files (x86)\NTRU Cryptosystems\NTRU TCG Software Stack\bin
16f0:17bc @ 01261439 - LdrpFindOrMapDll - ENTER: DLL name: UxTheme.dll DLL path: C:\Program Files (x86)\Microsoft Visual Studio 9.0\Common7\IDE;C:\Windows\system32;C:\Windows\system;C:\Windows;.;C:\Program Files\Debugging Tools for Windows (x64)\winext\arcade;C:\Program Files (x86)\Windows Resource Kits\Tools\;c:\python26;C:\Perl\bin\;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files (x86)\NTRU Cryptosystems\NTRU TCG Software Stack\bin\;C:
16f0:17bc @ 01261439 - LdrpFindKnownDll - ENTER: DLL name: UxTheme.dll
16f0:17bc @ 01261439 - LdrpFindKnownDll - RETURN: Status: 0xc0000135
16f0:17bc @ 01261439 - LdrpSearchPath - ENTER: DLL name: UxTheme.dll DLL path: C:\Program Files (x86)\Microsoft Visual Studio 9.0\Common7\IDE;C:\Windows\system32;C:\Windows\system;C:\Windows;.;C:\Program Files\Debugging Tools for Windows (x64)\winext\arcade;C:\Program Files (x86)\Windows Resource Kits\Tools\;c:\python26;C:\Perl\bin\;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files (x86)\NTRU Cryptosystems\NTRU TCG Software Stack\bin\;C:\P
16f0:17bc @ 01261439 - LdrpResolveFileName - ENTER: DLL name: C:\Program Files (x86)\Microsoft Visual Studio 9.0\Common7\IDE\UxTheme.dll
16f0:17bc @ 01261439 - LdrpResolveFileName - RETURN: Status: 0x00000000
16f0:17bc @ 01261439 - LdrpResolveDllName - ENTER: DLL name: C:\Program Files (x86)\Microsoft Visual Studio 9.0\Common7\IDE\UxTheme.dll
16f0:17bc @ 01261439 - LdrpResolveDllName - RETURN: Status: 0x00000000
16f0:17bc @ 01261439 - LdrpSearchPath - RETURN: Status: 0x00000000
16f0:17bc @ 01261439 - LdrpFindOrMapDll - RETURN: Status: 0xc00000ba
16f0:17bc @ 01261439 - LdrpLoadImportModule - ERROR: Loading DLL UxTheme.dll from path C:\Program Files (x86)\Microsoft Visual Studio 9.0\Common7\IDE;C:\Windows\system32;C:\Windows\system;C:\Windows;.;C:\Program Files\Debugging Tools for Windows (x64)\winext\arcade;C:\Program Files (x86)\Windows Resource Kits\Tools\;c:\python26;C:\Perl\bin\;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files (x86)\NTRU Cryptosystems\NTRU TCG Software Stack\b
16f0:17bc @ 01261439 - LdrpLoadImportModule - RETURN: Status: 0xc00000ba
16f0:17bc @ 01261439 - LdrpHandleOneOldFormatImportDescriptor - ERROR: Loading "?????l?????l????????????l??????l?????l??????????????????L??????l?????????????????n?n?n?n?n?n?n?n?n?n?n?n?n?n?n?n?n?n?n?n?n?n?n?n?n?n?n?n?n?n?n?n?n?n?n?n?n?n?n?n?n?n?n?n?n?n?n?n?n" from the import table of DLL "c:\Program Files (x86)\Microsoft Visual Studio 9.0\Common7\IDE\msenv.dll" failed with status 0xc00000ba

The other thing I should have noticed was in the modules window. The path for uxtheme.dll was a relative path.

Modules Window

The final solution I found was to change the symbol file locations to explicitly set the downstream stores to end at my local symbol cache directory for each pdb location.

c:\symbols\cache*s:\clib\symbols\internal
c:\symbols\cache*s:\clib\symbols\microsoft*http://msdl.microsoft.com/download/symbols

References:

http://support.microsoft.com/kb/147314

http://www.microsoft.com/msj/0999/hood/hood0999.aspx

Using SymSrv