code4k: High performance memcpy gotchas in C#

Saturday, October 23, 2010

High performance memcpy gotchas in C#

(Edit 8 Jan 2011: Update protocol test with Buffer.BlockCopy)
(Edit 11 Oct 2012: Please vote for the x86 cpblk deficiency on Microsoft Connect)
Following my last post about an interesting use of the "cpblk" IL instruction as an unmanaged memcpy replacement, I have to admit that I didn't take the time to carefully verify that performance is actually better. Well, I was probably too optimistic... so I have made some tests and the results are very surprising and not expected to be like these...

The memcpy protocol test in C#

When dealing with 3D calculations, large buffers of textures, audio synthesizing or whatever requires a memcpy and interaction with unmanaged world, you will most notably end up with a call to an unmanaged functions like this one:

[DllImport("msvcrt.dll", EntryPoint = "memcpy", CallingConvention = CallingConvention.Cdecl, SetLastError = false), SuppressUnmanagedCodeSecurity]
public static unsafe extern void* CopyMemory(void* dest, void* src, ulong count);

In this test, I'm going to compare this implementation with 4 challengers :

The cpblk IL instruction
A handmade memcpy function
Array.Copy, although It's not relevant because they don't have the same scope. Array.Copy is managed only for arrays only while memcpy is used to copy portion of datas between managed-unmanaged as well as unmanaged-unmanaged memory.
Marshal.Copy, same as Array.Copy
Buffer.BlockCopy, which is working on managed array but is working with a byte size block copy.

The test is performing a series of memcpy with different size of block : from 4 bytes to 2Mo. The interesting part is to run this test on a x86 and x64 mode. Both tests are running on the same Windows 7 OS x64, same machine Intel Core I5 750 (2.66Ghz). The CLR used for this is the Runtime v4.0.30319.

The naive handmade memcpy is nothing more than this code (not to be the best implem ever but at least safe for any kind of buffer size):

static unsafe void CustomCopy(void * dest, void* src, int count)
{
    int block;

    block = count >> 3;

    long* pDest = (long*)dest;
    long* pSrc = (long*)src;

    for (int i = 0; i < block; i++)
    {
        *pDest = *pSrc; pDest++; pSrc++;
    }
    dest = pDest;
    src = pSrc;
    count = count - (block << 3);

    if (count > 0)
    {
        byte* pDestB = (byte*) dest;
        byte* pSrcB = (byte*) src;
        for (int i = 0; i < count; i++)
        {
            *pDestB = *pSrcB; pDestB++; pSrcB++;
        }
    }
}

Results

For the x86 architecture, results are expressed as a throughput in Mo/s - higher is better, blocksize is in bytes :

BlockSize	x86-cpblk	x86-memcpy	x86-CustomCopy	x86-Array.Copy	x86-Marshal.Copy	x86-BlockCopy
4	146	458	470	85	81	150
8	294	843	1122	168	167	298
16	587	1628	1904	306	327	577
32	950	1876	3184	631	558	1079
64	1451	3316	4295	1205	1059	1981
128	2245	5161	4848	2176	1933	3386
256	4353	7032	5333	3699	3386	5333
512	8205	13617	5517	5663	6666	7441
1024	13617	20000	6666	7710	12075	9275
2048	18823	24615	7191	9142	16842	9552
4096	2922	7529	5663	10491	7032	11034
8192	2990	7804	5714	11228	7441	11636
16384	2857	7901	5614	9142	7619	10322
32768	2379	6736	5333	8101	6666	8205
65536	2379	6808	5470	8205	6808	8205
131072	2509	17777	5818	8101	17777	8101
262144	2500	11636	5423	7032	11428	7111
524288	2539	11428	5423	7111	11428	7111
1048576	2539	11428	5470	7032	11428	7111
2097152	2529	11428	5333	7032	11034	6881

For the x64 architecture:

BlockSize2	x64-cpblk	x64-memcpy	x64-CustomCopy	x64-Array.Copy	x64-Marshal.Copy	x64-BlockCopy
4	583	346	599	99	111	219
8	1509	770	1876	212	224	469
16	2689	1451	3316	417	422	903
32	4705	2666	5000	802	864	1739
64	8205	4812	7272	1568	1748	3350
128	13333	8101	9014	3004	3184	6037
256	18823	11428	10000	5470	5245	8648
512	22068	16000	10491	9014	9552	13913
1024	22857	19393	7356	13333	13617	16842
2048	23703	21333	7710	17297	17777	20645
4096	23703	22068	7804	19393	20000	21333
8192	23703	22857	7619	22068	22068	22857
16384	23703	22857	7804	17297	21333	18285
32768	16410	16410	7710	12800	16000	12800
65536	13061	14883	7710	13061	14545	13061
131072	14222	13913	7710	12800	13617	12800
262144	5000	5039	7032	7901	5000	7804
524288	5079	5000	7356	8205	5079	7804
1048576	4885	4885	7272	7441	4671	7529
2097152	5039	5079	7272	7619	5000	7710

Graph comparison only for cpblk, memcpy and CustomCopy:

Don't be afraid about the performance drop for most of the implem... It's mostly due to cache missing and copying around different 4k pages.

Conclusion

Don't trust your .NET VM, check your code on both x86 and x64. It's interesting to see how much the same task is implemented differently inside the CLR (see Marshal.Copy vs Array.Copy vs Buffer.Copy)

The most surprising result here is the poor performance of cpblk IL instruction in x86 mode compare to the best one in x64 which is... cpblk. So to summarize:

On x86, you should better use a memcpy function
On x64, you should better use a cpblk function, which is performing better from small size (twice faster than memcpy) to large size.

You may wonder why the x86 version is so unoptimized? This is because the x86 CLR is generating a x86 instruction that is performing a memcpy on a PER BYTE basis (rep movb for x86 folks), even if you are moving a large memory chunk of 1Mo! In comparison, a memcpy as implemented in MSVCRT is able to use SSE instructions that are able to batch copy with large 128 bits registers (with also an optimized case for not poluting CPU cache). This is the case for x64 that seems to use a correct implemented memcpy, but the x86 CLR memcpy is just poorly implemented. Please vote for this bug described on Microsoft Connect.

One important consequence of this is when you are developping a C++/CLI and calling a memcpy from a managed function... It will end up in a cpblk copy functions... which is almost the worst case on x86 platforms... so be careful if you are dealing with this kind of issue. To avoir this, you have to force the compiler to use the function from the MSVCRTxx.dll.

Of course, the memcpy is platform dependent, which would not be an option for all...

Also, I didn't perform this test on a CLR 2 runtime... we could be surprised as well... There is also one thing that I should try against a pure C++ memcpy using the optimized SSE2 version that is shipped with later msvcrt.

You can download the VS2010 project from here

7 comments:

Frank HilemanJanuary 9, 2011 at 1:20 AM
Hello Alexandre,

Your analysis is fascinating. Did you consider Buffer.BlockCopy? It is limited to managed arrays, but it is supposed to be more optimized for that purpose:

http://msdn.microsoft.com/en-us/library/system.buffer.blockcopy.aspx

Regards,
Frank Hileman
ReplyDelete
Replies
xoofx - Alexandre MUTELJanuary 9, 2011 at 9:55 AM
Thanks Frank for pointing me to this missing function... that I didn't know, cool!

I have slightly updated the comparison with this function. Still, It shows that cpblk is faster on x64 and memcpy on x86.

It still amazing to realize how much there are different implementation of memcpy into the CLR, moreover when you look at Array.Copy, Marshal.Copy, Buffer.BlockCopy that are not sharing the same implem.
ReplyDelete
Replies
UnknownApril 16, 2012 at 8:17 PM
Thanks for the interesting article. I downloaded the project and am puzzled by this line in DynamicInterop.cs:

if (IntPtr.Size==8) gen.Emit(Opcodes.Unaligned, 1);

The VS2010 documentation on the CpBlk opcode is that you need the unaligned prefix only if either of the operand addresses is not aligned the machine's native integer size (presumably 8 or 4 for x64 and x86 respectively) and is not related to the pointer size itself. Could this be the source of the weird performance result?
ReplyDelete
Replies
xoofx - Alexandre MUTELApril 17, 2012 at 9:38 AM
@unknown, I remember that I used the unaligned opcode because this was the way a C++/CLi memcpy was generated. I agree that it is not perfect, but I don't think that it will influence the results. The issue with x86 cpblk is that it is not optimized in the CLR, as they are using the simple x86 instruction ”rep movsb” (which is extremely not optimized!) unlike the x64 version that is using SSE2 instructions.
ReplyDelete
Replies
AnonymousJune 21, 2013 at 10:36 PM
This comment has been removed by the author.
ReplyDelete
Replies
AnonymousJune 21, 2013 at 11:12 PM
i got here by accident, so accident or not (: i have done that :

did not know as to what to be replaced instead here
if (IntPtr.Size==8) gen.Emit(Opcodes.Unaligned, 1);
so i tried onece without that line(commented it)
though no condition at all is bad for the country (:

and i've implemented system.buffer.blockcopy.aspx to extent ...(: (third from the right)

BlcSz cpblk memcpy ArrCpy BufBlcp CustCpy MrshlCpy (actually its a mistake
4----362----180----92 228 165 137

8 881 374 194 493 609 287
.... ......
...... ......
cause the headers are misplaced Cpblck is first header but memcpy is first parameter in console.WriteLine ({memcpy}{cpblck}{}{}{}
Console.WriteLine("{0}-----{1}-----{2}-----{3}-----{4}-----{5}-----{6}", blockSize, (long)memCpyOut, (long)copyMemoryOut, << i've just replaced "\t" with
"-----"
but thats how the parameters was originaly so insted

more accurate would be (:
BlcSz memcpy cpblk ArrCpy BufBlcp CustCpy MrshlCpy (actually its a mistake
4 362 180 92 228 165 137

8 881 374 194 493 609 287

so this is how the plot goes

BlcSz memcpy cpblk ArrCpy BufBlcp CustCpy MrshlCpy

--4-----362------82------92------228------165-----138

--8-----869-----169-----194------491------617-----288

--16---1600-----751-----378------939-----1007-----560

--32---2711----1422-----737-----1782-----1444----1109

--64---4705----2711----1447-----3316-----1910----2112

--128---7901----4740----2770-----5818-----1592----3720

-256---14545----8205----5039----10000-----2214----6736

-512---13617---12307----8533----14222-----2452---10491

-1024--14883---16410---12800----18285-----2601---14883

-2048--21333---19393---16000----20000-----2529---17297

-4096--22857---22068---19393----22068-----2560---20645

-8192--23703---22857---22068----23703-----2591---22857

-16384-23703---22068---19393----20645-----2591---22068

-32768-18823---18823---18823----18823-----2490---18285

-65536-18823---19393---18823----18823-----2480---18823

131072-19393---19393---18285----18285-----2500---18823

262144--6095----5981---10322----10666-----2490----5981----------here everyone falls

but less impact on bufferCopy()

-524288-----6037-----6037-----10491-----10491-----2500-----5871

1048576-----6037-----6095-----10666-----10491-----2500-----6037

2097152-----6037-----6095-----10322-----10000-----2480-----6153

4194304-----6095-----6037------9846------9846-----2344-----6095

8388608-----5663-----6274------7804------8101-----2327-----6274

16777216----6213-----6464------6400------7111-----2335-----6464

33554432----6153-----6597------6530------6808-----2335-----6597

67108864----5333-----6464------6464------6530-----2269-----6336

and here i think there was somthing bad happend to my test (:

134217728------9223372036854775808------9223372036854775808------922337203685477
5808<<---nochange to this one though]-----9223372036854775808------9223372036854775808------9223372036854775808

268435456------9223372036854775808------9223372036854775808------922337203685477
5808------9223372036854775808------9223372036854775808------9223372036854775808

536870912------9223372036854775808------9223372036854775808------922337203685477
5808------9223372036854775808------9223372036854775808------9223372036854775808

i replaced counter iteration with 30(was counting <20)
so i think i will stick to Buffer.Copy();
ReplyDelete
Replies
Harry SingletonJuly 5, 2014 at 2:42 AM
Hi, it would be awesome if you could repeat this test in VS2013 where cpblk has been fixed, please ;)
ReplyDelete
Replies

Add comment

Comments are disabled

Note: Only a member of this blog may post a comment.