(Edit 11 Oct 2012: Please vote for the x86 cpblk deficiency on Microsoft Connect)
Following my last post about an interesting use of the "cpblk" IL instruction as an unmanaged memcpy replacement, I have to admit that I didn't take the time to carefully verify that performance is actually better. Well, I was probably too optimistic... so I have made some tests and the results are very surprising and not expected to be like these...
The memcpy protocol test in C#
When dealing with 3D calculations, large buffers of textures, audio synthesizing or whatever requires a memcpy and interaction with unmanaged world, you will most notably end up with a call to an unmanaged functions like this one:
[DllImport("msvcrt.dll", EntryPoint = "memcpy", CallingConvention = CallingConvention.Cdecl, SetLastError = false), SuppressUnmanagedCodeSecurity] public static unsafe extern void* CopyMemory(void* dest, void* src, ulong count);
In this test, I'm going to compare this implementation with 4 challengers :
- The cpblk IL instruction
- A handmade memcpy function
- Array.Copy, although It's not relevant because they don't have the same scope. Array.Copy is managed only for arrays only while memcpy is used to copy portion of datas between managed-unmanaged as well as unmanaged-unmanaged memory.
- Marshal.Copy, same as Array.Copy
- Buffer.BlockCopy, which is working on managed array but is working with a byte size block copy.
The naive handmade memcpy is nothing more than this code (not to be the best implem ever but at least safe for any kind of buffer size):
static unsafe void CustomCopy(void * dest, void* src, int count) { int block; block = count >> 3; long* pDest = (long*)dest; long* pSrc = (long*)src; for (int i = 0; i < block; i++) { *pDest = *pSrc; pDest++; pSrc++; } dest = pDest; src = pSrc; count = count - (block << 3); if (count > 0) { byte* pDestB = (byte*) dest; byte* pSrcB = (byte*) src; for (int i = 0; i < count; i++) { *pDestB = *pSrcB; pDestB++; pSrcB++; } } }
Results
For the x86 architecture, results are expressed as a throughput in Mo/s - higher is better, blocksize is in bytes :BlockSize | x86-cpblk | x86-memcpy | x86-CustomCopy | x86-Array.Copy | x86-Marshal.Copy | x86-BlockCopy |
4 | 146 | 458 | 470 | 85 | 81 | 150 |
8 | 294 | 843 | 1122 | 168 | 167 | 298 |
16 | 587 | 1628 | 1904 | 306 | 327 | 577 |
32 | 950 | 1876 | 3184 | 631 | 558 | 1079 |
64 | 1451 | 3316 | 4295 | 1205 | 1059 | 1981 |
128 | 2245 | 5161 | 4848 | 2176 | 1933 | 3386 |
256 | 4353 | 7032 | 5333 | 3699 | 3386 | 5333 |
512 | 8205 | 13617 | 5517 | 5663 | 6666 | 7441 |
1024 | 13617 | 20000 | 6666 | 7710 | 12075 | 9275 |
2048 | 18823 | 24615 | 7191 | 9142 | 16842 | 9552 |
4096 | 2922 | 7529 | 5663 | 10491 | 7032 | 11034 |
8192 | 2990 | 7804 | 5714 | 11228 | 7441 | 11636 |
16384 | 2857 | 7901 | 5614 | 9142 | 7619 | 10322 |
32768 | 2379 | 6736 | 5333 | 8101 | 6666 | 8205 |
65536 | 2379 | 6808 | 5470 | 8205 | 6808 | 8205 |
131072 | 2509 | 17777 | 5818 | 8101 | 17777 | 8101 |
262144 | 2500 | 11636 | 5423 | 7032 | 11428 | 7111 |
524288 | 2539 | 11428 | 5423 | 7111 | 11428 | 7111 |
1048576 | 2539 | 11428 | 5470 | 7032 | 11428 | 7111 |
2097152 | 2529 | 11428 | 5333 | 7032 | 11034 | 6881 |
For the x64 architecture:
BlockSize2 | x64-cpblk | x64-memcpy | x64-CustomCopy | x64-Array.Copy | x64-Marshal.Copy | x64-BlockCopy |
4 | 583 | 346 | 599 | 99 | 111 | 219 |
8 | 1509 | 770 | 1876 | 212 | 224 | 469 |
16 | 2689 | 1451 | 3316 | 417 | 422 | 903 |
32 | 4705 | 2666 | 5000 | 802 | 864 | 1739 |
64 | 8205 | 4812 | 7272 | 1568 | 1748 | 3350 |
128 | 13333 | 8101 | 9014 | 3004 | 3184 | 6037 |
256 | 18823 | 11428 | 10000 | 5470 | 5245 | 8648 |
512 | 22068 | 16000 | 10491 | 9014 | 9552 | 13913 |
1024 | 22857 | 19393 | 7356 | 13333 | 13617 | 16842 |
2048 | 23703 | 21333 | 7710 | 17297 | 17777 | 20645 |
4096 | 23703 | 22068 | 7804 | 19393 | 20000 | 21333 |
8192 | 23703 | 22857 | 7619 | 22068 | 22068 | 22857 |
16384 | 23703 | 22857 | 7804 | 17297 | 21333 | 18285 |
32768 | 16410 | 16410 | 7710 | 12800 | 16000 | 12800 |
65536 | 13061 | 14883 | 7710 | 13061 | 14545 | 13061 |
131072 | 14222 | 13913 | 7710 | 12800 | 13617 | 12800 |
262144 | 5000 | 5039 | 7032 | 7901 | 5000 | 7804 |
524288 | 5079 | 5000 | 7356 | 8205 | 5079 | 7804 |
1048576 | 4885 | 4885 | 7272 | 7441 | 4671 | 7529 |
2097152 | 5039 | 5079 | 7272 | 7619 | 5000 | 7710 |
Graph comparison only for cpblk, memcpy and CustomCopy:
Don't be afraid about the performance drop for most of the implem... It's mostly due to cache missing and copying around different 4k pages.
Conclusion
Don't trust your .NET VM, check your code on both x86 and x64. It's interesting to see how much the same task is implemented differently inside the CLR (see Marshal.Copy vs Array.Copy vs Buffer.Copy)The most surprising result here is the poor performance of cpblk IL instruction in x86 mode compare to the best one in x64 which is... cpblk. So to summarize:
- On x86, you should better use a memcpy function
- On x64, you should better use a cpblk function, which is performing better from small size (twice faster than memcpy) to large size.
One important consequence of this is when you are developping a C++/CLI and calling a memcpy from a managed function... It will end up in a cpblk copy functions... which is almost the worst case on x86 platforms... so be careful if you are dealing with this kind of issue. To avoir this, you have to force the compiler to use the function from the MSVCRTxx.dll.
Of course, the memcpy is platform dependent, which would not be an option for all...
Also, I didn't perform this test on a CLR 2 runtime... we could be surprised as well... There is also one thing that I should try against a pure C++ memcpy using the optimized SSE2 version that is shipped with later msvcrt.
You can download the VS2010 project from here
Hello Alexandre,
ReplyDeleteYour analysis is fascinating. Did you consider Buffer.BlockCopy? It is limited to managed arrays, but it is supposed to be more optimized for that purpose:
http://msdn.microsoft.com/en-us/library/system.buffer.blockcopy.aspx
Regards,
Frank Hileman
Thanks Frank for pointing me to this missing function... that I didn't know, cool!
ReplyDeleteI have slightly updated the comparison with this function. Still, It shows that cpblk is faster on x64 and memcpy on x86.
It still amazing to realize how much there are different implementation of memcpy into the CLR, moreover when you look at Array.Copy, Marshal.Copy, Buffer.BlockCopy that are not sharing the same implem.
Thanks for the interesting article. I downloaded the project and am puzzled by this line in DynamicInterop.cs:
ReplyDeleteif (IntPtr.Size==8) gen.Emit(Opcodes.Unaligned, 1);
The VS2010 documentation on the CpBlk opcode is that you need the unaligned prefix only if either of the operand addresses is not aligned the machine's native integer size (presumably 8 or 4 for x64 and x86 respectively) and is not related to the pointer size itself. Could this be the source of the weird performance result?
@unknown, I remember that I used the unaligned opcode because this was the way a C++/CLi memcpy was generated. I agree that it is not perfect, but I don't think that it will influence the results. The issue with x86 cpblk is that it is not optimized in the CLR, as they are using the simple x86 instruction ”rep movsb” (which is extremely not optimized!) unlike the x64 version that is using SSE2 instructions.
ReplyDeleteThis comment has been removed by the author.
ReplyDeletei got here by accident, so accident or not (: i have done that :
ReplyDeletedid not know as to what to be replaced instead here
if (IntPtr.Size==8) gen.Emit(Opcodes.Unaligned, 1);
so i tried onece without that line(commented it)
though no condition at all is bad for the country (:
and i've implemented system.buffer.blockcopy.aspx to extent ...(: (third from the right)
BlcSz cpblk memcpy ArrCpy BufBlcp CustCpy MrshlCpy (actually its a mistake
4----362----180----92 228 165 137
8 881 374 194 493 609 287
.... ......
...... ......
cause the headers are misplaced Cpblck is first header but memcpy is first parameter in console.WriteLine ({memcpy}{cpblck}{}{}{}
Console.WriteLine("{0}-----{1}-----{2}-----{3}-----{4}-----{5}-----{6}", blockSize, (long)memCpyOut, (long)copyMemoryOut, << i've just replaced "\t" with
"-----"
but thats how the parameters was originaly so insted
more accurate would be (:
BlcSz memcpy cpblk ArrCpy BufBlcp CustCpy MrshlCpy (actually its a mistake
4 362 180 92 228 165 137
8 881 374 194 493 609 287
so this is how the plot goes
BlcSz memcpy cpblk ArrCpy BufBlcp CustCpy MrshlCpy
--4-----362------82------92------228------165-----138
--8-----869-----169-----194------491------617-----288
--16---1600-----751-----378------939-----1007-----560
--32---2711----1422-----737-----1782-----1444----1109
--64---4705----2711----1447-----3316-----1910----2112
--128---7901----4740----2770-----5818-----1592----3720
-256---14545----8205----5039----10000-----2214----6736
-512---13617---12307----8533----14222-----2452---10491
-1024--14883---16410---12800----18285-----2601---14883
-2048--21333---19393---16000----20000-----2529---17297
-4096--22857---22068---19393----22068-----2560---20645
-8192--23703---22857---22068----23703-----2591---22857
-16384-23703---22068---19393----20645-----2591---22068
-32768-18823---18823---18823----18823-----2490---18285
-65536-18823---19393---18823----18823-----2480---18823
131072-19393---19393---18285----18285-----2500---18823
262144--6095----5981---10322----10666-----2490----5981----------here everyone falls
but less impact on bufferCopy()
-524288-----6037-----6037-----10491-----10491-----2500-----5871
1048576-----6037-----6095-----10666-----10491-----2500-----6037
2097152-----6037-----6095-----10322-----10000-----2480-----6153
4194304-----6095-----6037------9846------9846-----2344-----6095
8388608-----5663-----6274------7804------8101-----2327-----6274
16777216----6213-----6464------6400------7111-----2335-----6464
33554432----6153-----6597------6530------6808-----2335-----6597
67108864----5333-----6464------6464------6530-----2269-----6336
and here i think there was somthing bad happend to my test (:
134217728------9223372036854775808------9223372036854775808------922337203685477
5808<<---nochange to this one though]-----9223372036854775808------9223372036854775808------9223372036854775808
268435456------9223372036854775808------9223372036854775808------922337203685477
5808------9223372036854775808------9223372036854775808------9223372036854775808
536870912------9223372036854775808------9223372036854775808------922337203685477
5808------9223372036854775808------9223372036854775808------9223372036854775808
i replaced counter iteration with 30(was counting <20)
so i think i will stick to Buffer.Copy();
Hi, it would be awesome if you could repeat this test in VS2013 where cpblk has been fixed, please ;)
ReplyDelete