(Edit 8 Jan 2011: Update protocol test with Buffer.BlockCopy)
(Edit 11 Oct 2012: Please vote for the x86 cpblk deficiency on Microsoft Connect)
Following my last post about an interesting use of the "cpblk" IL instruction as an unmanaged memcpy replacement, I have to admit that I didn't take the time to carefully verify that performance is actually better. Well, I was probably too optimistic... so I have made some tests and the results are very surprising and not expected to be like these...
The memcpy protocol test in C#
When dealing with 3D calculations, large buffers of textures, audio synthesizing or whatever requires a memcpy and interaction with unmanaged world, you will most notably end up with a call to an unmanaged functions like this one:
[DllImport("msvcrt.dll", EntryPoint = "memcpy", CallingConvention = CallingConvention.Cdecl, SetLastError = false), SuppressUnmanagedCodeSecurity]
public static unsafe extern void* CopyMemory(void* dest, void* src, ulong count);
In this test, I'm going to compare this implementation with 4 challengers :
- The cpblk IL instruction
- A handmade memcpy function
- Array.Copy, although It's not relevant because they don't have the same scope. Array.Copy is managed only for arrays only while memcpy is used to copy portion of datas between managed-unmanaged as well as unmanaged-unmanaged memory.
- Marshal.Copy, same as Array.Copy
- Buffer.BlockCopy, which is working on managed array but is working with a byte size block copy.
The test is performing a series of memcpy with different size of block : from 4 bytes to 2Mo. The interesting part is to run this test on a x86 and x64 mode. Both tests are running on the same Windows 7 OS x64, same machine Intel Core I5 750 (2.66Ghz). The CLR used for this is the Runtime v4.0.30319.
The naive handmade memcpy is nothing more than this code (not to be the best implem ever but at least safe for any kind of buffer size):
static unsafe void CustomCopy(void * dest, void* src, int count)
{
int block;
block = count >> 3;
long* pDest = (long*)dest;
long* pSrc = (long*)src;
for (int i = 0; i < block; i++)
{
*pDest = *pSrc; pDest++; pSrc++;
}
dest = pDest;
src = pSrc;
count = count - (block << 3);
if (count > 0)
{
byte* pDestB = (byte*) dest;
byte* pSrcB = (byte*) src;
for (int i = 0; i < count; i++)
{
*pDestB = *pSrcB; pDestB++; pSrcB++;
}
}
}
Results
For the x86 architecture, results are expressed as a throughput in Mo/s - higher is better, blocksize is in bytes :
BlockSize | x86-cpblk | x86-memcpy | x86-CustomCopy | x86-Array.Copy | x86-Marshal.Copy | x86-BlockCopy |
4 | 146 | 458 | 470 | 85 | 81 | 150 |
8 | 294 | 843 | 1122 | 168 | 167 | 298 |
16 | 587 | 1628 | 1904 | 306 | 327 | 577 |
32 | 950 | 1876 | 3184 | 631 | 558 | 1079 |
64 | 1451 | 3316 | 4295 | 1205 | 1059 | 1981 |
128 | 2245 | 5161 | 4848 | 2176 | 1933 | 3386 |
256 | 4353 | 7032 | 5333 | 3699 | 3386 | 5333 |
512 | 8205 | 13617 | 5517 | 5663 | 6666 | 7441 |
1024 | 13617 | 20000 | 6666 | 7710 | 12075 | 9275 |
2048 | 18823 | 24615 | 7191 | 9142 | 16842 | 9552 |
4096 | 2922 | 7529 | 5663 | 10491 | 7032 | 11034 |
8192 | 2990 | 7804 | 5714 | 11228 | 7441 | 11636 |
16384 | 2857 | 7901 | 5614 | 9142 | 7619 | 10322 |
32768 | 2379 | 6736 | 5333 | 8101 | 6666 | 8205 |
65536 | 2379 | 6808 | 5470 | 8205 | 6808 | 8205 |
131072 | 2509 | 17777 | 5818 | 8101 | 17777 | 8101 |
262144 | 2500 | 11636 | 5423 | 7032 | 11428 | 7111 |
524288 | 2539 | 11428 | 5423 | 7111 | 11428 | 7111 |
1048576 | 2539 | 11428 | 5470 | 7032 | 11428 | 7111 |
2097152 | 2529 | 11428 | 5333 | 7032 | 11034 | 6881 |
For the x64 architecture:
BlockSize2 | x64-cpblk | x64-memcpy | x64-CustomCopy | x64-Array.Copy | x64-Marshal.Copy | x64-BlockCopy |
4 | 583 | 346 | 599 | 99 | 111 | 219 |
8 | 1509 | 770 | 1876 | 212 | 224 | 469 |
16 | 2689 | 1451 | 3316 | 417 | 422 | 903 |
32 | 4705 | 2666 | 5000 | 802 | 864 | 1739 |
64 | 8205 | 4812 | 7272 | 1568 | 1748 | 3350 |
128 | 13333 | 8101 | 9014 | 3004 | 3184 | 6037 |
256 | 18823 | 11428 | 10000 | 5470 | 5245 | 8648 |
512 | 22068 | 16000 | 10491 | 9014 | 9552 | 13913 |
1024 | 22857 | 19393 | 7356 | 13333 | 13617 | 16842 |
2048 | 23703 | 21333 | 7710 | 17297 | 17777 | 20645 |
4096 | 23703 | 22068 | 7804 | 19393 | 20000 | 21333 |
8192 | 23703 | 22857 | 7619 | 22068 | 22068 | 22857 |
16384 | 23703 | 22857 | 7804 | 17297 | 21333 | 18285 |
32768 | 16410 | 16410 | 7710 | 12800 | 16000 | 12800 |
65536 | 13061 | 14883 | 7710 | 13061 | 14545 | 13061 |
131072 | 14222 | 13913 | 7710 | 12800 | 13617 | 12800 |
262144 | 5000 | 5039 | 7032 | 7901 | 5000 | 7804 |
524288 | 5079 | 5000 | 7356 | 8205 | 5079 | 7804 |
1048576 | 4885 | 4885 | 7272 | 7441 | 4671 | 7529 |
2097152 | 5039 | 5079 | 7272 | 7619 | 5000 | 7710 |
Graph comparison only for cpblk, memcpy and CustomCopy:
Don't be afraid about the performance drop for most of the implem... It's mostly due to cache missing and copying around different 4k pages.
Conclusion
Don't trust your .NET VM, check your code on both x86 and x64. It's interesting to see how much the same task is implemented differently inside the CLR (see Marshal.Copy vs Array.Copy vs Buffer.Copy)
The most surprising result here is
the poor performance of cpblk IL instruction in x86 mode compare to the best one in x64 which is... cpblk. So to summarize:
- On x86, you should better use a memcpy function
- On x64, you should better use a cpblk function, which is performing better from small size (twice faster than memcpy) to large size.
You may wonder why the x86 version is so unoptimized? This is because the x86 CLR is generating a x86 instruction that is performing a memcpy on a PER BYTE basis (rep movb for x86 folks), even if you are moving a large memory chunk of 1Mo! In comparison, a memcpy as implemented in MSVCRT is able to use SSE instructions that are able to batch copy with large 128 bits registers (with also an optimized case for not poluting CPU cache). This is the case for x64 that seems to use a correct implemented memcpy, but the x86 CLR memcpy is just poorly implemented. Please vote for this
bug described on Microsoft Connect.
One important consequence of this is when you are developping a C++/CLI and calling a memcpy from a managed function... It will end up in a cpblk copy functions... which is almost the worst case on x86 platforms... so be careful if you are dealing with this kind of issue. To avoir this, you have to force the compiler to use the function from the MSVCRTxx.dll.
Of course, the memcpy is platform dependent, which would not be an option for all...
Also, I didn't perform this test on a CLR 2 runtime... we could be surprised as well... There is also one thing that I should try against a pure C++ memcpy using the optimized SSE2 version that is shipped with later msvcrt.
You can download the VS2010 project from
here