Saturday, October 23, 2010

High performance memcpy gotchas in C#

(Edit 8 Jan 2011: Update protocol test with Buffer.BlockCopy)
(Edit 11 Oct 2012: Please vote for the x86 cpblk deficiency on Microsoft Connect)
Following my last post about an interesting use of the "cpblk" IL instruction as an unmanaged memcpy replacement, I have to admit that I didn't take the time to carefully verify that performance is actually better. Well, I was probably too optimistic... so I have made some tests and the results are very surprising and not expected to be like these...

The memcpy protocol test in C#


When dealing with 3D calculations, large buffers of textures, audio synthesizing or whatever requires a memcpy and interaction with unmanaged world, you will most notably end up with a call to an unmanaged functions like this one:

[DllImport("msvcrt.dll", EntryPoint = "memcpy", CallingConvention = CallingConvention.Cdecl, SetLastError = false), SuppressUnmanagedCodeSecurity]
public static unsafe extern void* CopyMemory(void* dest, void* src, ulong count);

In this test, I'm going to compare this implementation with 4 challengers :
  • The cpblk IL instruction
  • A handmade memcpy function
  • Array.Copy, although It's not relevant because they don't have the same scope. Array.Copy is managed only for arrays only while memcpy is used to copy portion of datas between managed-unmanaged as well as unmanaged-unmanaged memory.
  • Marshal.Copy, same as Array.Copy
  • Buffer.BlockCopy, which is working on managed array but is working with a byte size block copy.
The test is performing a series of memcpy with different size of block : from 4 bytes to 2Mo. The interesting part is to run this test on a x86 and x64 mode. Both tests are running on the same Windows 7 OS x64, same machine Intel Core I5 750 (2.66Ghz). The CLR used for this is the Runtime v4.0.30319.

The naive handmade memcpy is nothing more than this code (not to be the best implem ever but at least safe for any kind of buffer size):

static unsafe void CustomCopy(void * dest, void* src, int count)
{
    int block;

    block = count >> 3;

    long* pDest = (long*)dest;
    long* pSrc = (long*)src;

    for (int i = 0; i < block; i++)
    {
        *pDest = *pSrc; pDest++; pSrc++;
    }
    dest = pDest;
    src = pSrc;
    count = count - (block << 3);

    if (count > 0)
    {
        byte* pDestB = (byte*) dest;
        byte* pSrcB = (byte*) src;
        for (int i = 0; i < count; i++)
        {
            *pDestB = *pSrcB; pDestB++; pSrcB++;
        }
    }
}

Results

For the x86 architecture, results are expressed as a throughput in Mo/s - higher is better, blocksize is in bytes :

BlockSize x86-cpblk x86-memcpy x86-CustomCopy x86-Array.Copy x86-Marshal.Copy x86-BlockCopy
4 146 458 470 85 81 150
8 294 843 1122 168 167 298
16 587 1628 1904 306 327 577
32 950 1876 3184 631 558 1079
64 1451 3316 4295 1205 1059 1981
128 2245 5161 4848 2176 1933 3386
256 4353 7032 5333 3699 3386 5333
512 8205 13617 5517 5663 6666 7441
1024 13617 20000 6666 7710 12075 9275
2048 18823 24615 7191 9142 16842 9552
4096 2922 7529 5663 10491 7032 11034
8192 2990 7804 5714 11228 7441 11636
16384 2857 7901 5614 9142 7619 10322
32768 2379 6736 5333 8101 6666 8205
65536 2379 6808 5470 8205 6808 8205
131072 2509 17777 5818 8101 17777 8101
262144 2500 11636 5423 7032 11428 7111
524288 2539 11428 5423 7111 11428 7111
1048576 2539 11428 5470 7032 11428 7111
2097152 2529 11428 5333 7032 11034 6881




For the x64 architecture:


BlockSize2 x64-cpblk x64-memcpy x64-CustomCopy x64-Array.Copy x64-Marshal.Copy x64-BlockCopy
4 583 346 599 99 111 219
8 1509 770 1876 212 224 469
16 2689 1451 3316 417 422 903
32 4705 2666 5000 802 864 1739
64 8205 4812 7272 1568 1748 3350
128 13333 8101 9014 3004 3184 6037
256 18823 11428 10000 5470 5245 8648
512 22068 16000 10491 9014 9552 13913
1024 22857 19393 7356 13333 13617 16842
2048 23703 21333 7710 17297 17777 20645
4096 23703 22068 7804 19393 20000 21333
8192 23703 22857 7619 22068 22068 22857
16384 23703 22857 7804 17297 21333 18285
32768 16410 16410 7710 12800 16000 12800
65536 13061 14883 7710 13061 14545 13061
131072 14222 13913 7710 12800 13617 12800
262144 5000 5039 7032 7901 5000 7804
524288 5079 5000 7356 8205 5079 7804
1048576 4885 4885 7272 7441 4671 7529
2097152 5039 5079 7272 7619 5000 7710

Graph comparison only for cpblk, memcpy and CustomCopy:


Don't be afraid about the performance drop for most of the implem... It's mostly due to cache missing  and copying around different 4k pages.

Conclusion

Don't trust your .NET VM, check your code on both x86 and x64. It's interesting to see how much the same task is implemented differently inside the CLR (see Marshal.Copy vs  Array.Copy vs Buffer.Copy)

The most surprising result here is the poor performance of cpblk IL instruction in x86 mode compare to the best one in x64 which is... cpblk. So to summarize:
  • On x86, you should better use a memcpy function
  • On x64, you should better use a cpblk function, which is performing better from small size (twice faster than memcpy) to large size.
You may wonder why the x86 version is so unoptimized? This is because the x86 CLR is generating a x86 instruction that is performing a memcpy on a PER BYTE basis (rep movb for x86 folks), even if you are moving a large memory chunk of 1Mo! In comparison, a memcpy as implemented in MSVCRT is able to use SSE instructions that are able to batch copy with large 128 bits  registers (with also an optimized case for not poluting CPU cache). This is the case for x64 that seems to use a correct implemented memcpy, but the x86 CLR memcpy is just poorly implemented. Please vote for this bug described on Microsoft Connect.

One important consequence of this is when you are developping a C++/CLI and calling a memcpy from a managed function... It will end up in a cpblk copy functions... which is almost the worst case on x86 platforms... so be careful if you are dealing with this kind of issue. To avoir this, you have to force the compiler to use the function from the MSVCRTxx.dll.

Of course, the memcpy is platform dependent, which would not be an option for all...

Also, I didn't perform this test on a CLR 2 runtime... we could be surprised as well... There is also one thing that I should try against a pure C++ memcpy using the optimized SSE2 version that is shipped with later msvcrt.

You can download the VS2010 project from here

6 comments:

  1. Hello Alexandre,

    Your analysis is fascinating. Did you consider Buffer.BlockCopy? It is limited to managed arrays, but it is supposed to be more optimized for that purpose:

    http://msdn.microsoft.com/en-us/library/system.buffer.blockcopy.aspx

    Regards,
    Frank Hileman

    ReplyDelete
  2. Thanks Frank for pointing me to this missing function... that I didn't know, cool!

    I have slightly updated the comparison with this function. Still, It shows that cpblk is faster on x64 and memcpy on x86.

    It still amazing to realize how much there are different implementation of memcpy into the CLR, moreover when you look at Array.Copy, Marshal.Copy, Buffer.BlockCopy that are not sharing the same implem.

    ReplyDelete
  3. Thanks for the interesting article. I downloaded the project and am puzzled by this line in DynamicInterop.cs:

    if (IntPtr.Size==8) gen.Emit(Opcodes.Unaligned, 1);

    The VS2010 documentation on the CpBlk opcode is that you need the unaligned prefix only if either of the operand addresses is not aligned the machine's native integer size (presumably 8 or 4 for x64 and x86 respectively) and is not related to the pointer size itself. Could this be the source of the weird performance result?

    ReplyDelete
  4. @unknown, I remember that I used the unaligned opcode because this was the way a C++/CLi memcpy was generated. I agree that it is not perfect, but I don't think that it will influence the results. The issue with x86 cpblk is that it is not optimized in the CLR, as they are using the simple x86 instruction ”rep movsb” (which is extremely not optimized!) unlike the x64 version that is using SSE2 instructions.

    ReplyDelete
  5. This comment has been removed by the author.

    ReplyDelete
  6. i got here by accident, so accident or not (: i have done that :

    did not know as to what to be replaced instead here
    if (IntPtr.Size==8) gen.Emit(Opcodes.Unaligned, 1);
    so i tried onece without that line(commented it)
    though no condition at all is bad for the country (:


    and i've implemented system.buffer.blockcopy.aspx to extent ...(: (third from the right)

    BlcSz cpblk memcpy ArrCpy BufBlcp CustCpy MrshlCpy (actually its a mistake
    4----362----180----92 228 165 137

    8 881 374 194 493 609 287
    .... ......
    ...... ......
    cause the headers are misplaced Cpblck is first header but memcpy is first parameter in console.WriteLine ({memcpy}{cpblck}{}{}{}
    Console.WriteLine("{0}-----{1}-----{2}-----{3}-----{4}-----{5}-----{6}", blockSize, (long)memCpyOut, (long)copyMemoryOut, << i've just replaced "\t" with
    "-----"
    but thats how the parameters was originaly so insted

    more accurate would be (:
    BlcSz memcpy cpblk ArrCpy BufBlcp CustCpy MrshlCpy (actually its a mistake
    4 362 180 92 228 165 137

    8 881 374 194 493 609 287

    so this is how the plot goes

    BlcSz memcpy cpblk ArrCpy BufBlcp CustCpy MrshlCpy

    --4-----362------82------92------228------165-----138

    --8-----869-----169-----194------491------617-----288

    --16---1600-----751-----378------939-----1007-----560

    --32---2711----1422-----737-----1782-----1444----1109

    --64---4705----2711----1447-----3316-----1910----2112

    --128---7901----4740----2770-----5818-----1592----3720

    -256---14545----8205----5039----10000-----2214----6736

    -512---13617---12307----8533----14222-----2452---10491

    -1024--14883---16410---12800----18285-----2601---14883

    -2048--21333---19393---16000----20000-----2529---17297

    -4096--22857---22068---19393----22068-----2560---20645

    -8192--23703---22857---22068----23703-----2591---22857

    -16384-23703---22068---19393----20645-----2591---22068

    -32768-18823---18823---18823----18823-----2490---18285

    -65536-18823---19393---18823----18823-----2480---18823

    131072-19393---19393---18285----18285-----2500---18823

    262144--6095----5981---10322----10666-----2490----5981----------here everyone falls

    but less impact on bufferCopy()

    -524288-----6037-----6037-----10491-----10491-----2500-----5871

    1048576-----6037-----6095-----10666-----10491-----2500-----6037

    2097152-----6037-----6095-----10322-----10000-----2480-----6153

    4194304-----6095-----6037------9846------9846-----2344-----6095

    8388608-----5663-----6274------7804------8101-----2327-----6274

    16777216----6213-----6464------6400------7111-----2335-----6464

    33554432----6153-----6597------6530------6808-----2335-----6597

    67108864----5333-----6464------6464------6530-----2269-----6336

    and here i think there was somthing bad happend to my test (:

    134217728------9223372036854775808------9223372036854775808------922337203685477
    5808<<---nochange to this one though]-----9223372036854775808------9223372036854775808------9223372036854775808

    268435456------9223372036854775808------9223372036854775808------922337203685477
    5808------9223372036854775808------9223372036854775808------9223372036854775808

    536870912------9223372036854775808------9223372036854775808------922337203685477
    5808------9223372036854775808------9223372036854775808------9223372036854775808




    i replaced counter iteration with 30(was counting <20)
    so i think i will stick to Buffer.Copy();

    ReplyDelete