 RyuJIT CTP4
 .NET Native developer preview 2
While .NET JIT is performing quite well on Windows, it is still behind a fully optimized C++ program, though an efficient compiled program is not only about code generation but also memory management and data locality, the .NET Team has recently introduced two new technologies that might help for the code gen part: The introduction of .NET Native, an offline .NET compiler (similar to ngen, but using the backend optimizer that is used by the C++ compiler) and the next generation of .NET JIT called "RyuJit". In this post I would like to present the results of some microbenchmarks that are roughly evaluating some performance benefits of these two new technologies.
First of all, you may have already read about a few benchmarks results available around about RyuJit and .NET Native, here is a nonexhaustive list I have found (if you have more pointers, let me know!):
 A first look at RyuJIT CTP3 and SIMD (SSE2) support for .NET by Frank Niemeyer
 Lies, damn lies and benchmarks by Kevin Frei
 .NET Native Performance by Sasha Goldstein
The microbenchmark protocol
Microbenchmarking is not the best way to give a measure of the overall benefits, but it can help to dig into some particular patterns. For this benchmark, I haven't developed a new one but instead built a "freakbenchmark" composed of some microbenchmarks I found on Internet, mainly:
 "Headtohead benchmark: C++ vs .NET" by Qwertie has a nice collection of microbenchmarks. So I decided to use it as a basis
 "A Collection of PhoenixCompatible C# Benchmarks" I used the port of a subset of Java Grande benchmarks.
 Two custom benchmarks measuring the cost of interop which is important in cases where you can possibly call lots of native methods (which is the case when using SharpDX for example)
I don't claim that these microbenchmarks are exhaustive nor they are all correctly implemented (some of the JavaGrande benchmark seems to be not robust), but as we are measuring relative performance, that should be fine. In the end we just want to know how much .NET Native or RyuJit can perform compare to the same program running on the legacy JIT.
Also as both .NET Native and RyuJit are in development, we can't really draw any definitive conclusions.
.NET Native is only available for Windows Store App while RyuJit is only available on x64, hence the platforms tested in this bench are:
 .NET 32 Desktop
 .NET 32 AppStore
 .NET 32 AppStore Native
 .NET 64 Desktop
 .NET 64 AppStore
 .NET 64 AppStore Native
 .NET 64 Desktop RyuJit
.NET32/.NET64 using .NET Framework 4.5.1. The machine is an Intel(R) Core(TM) i74770 CPU @ 3.4GHz with 16Go of RAM.
The source of these benchmarks is available on GitHub BenchNativeApp.
Comparison .NET32 (x86)
Comparison between:
 .NET 32 Desktop
 .NET 32 AppStore
 .NET 32 AppStore Native
 In green, results above +10%
 In red, results below 10%
Name 
.NET 32
(Desktop) 
.NET 32 (AppStore) 
.NET 32 Native (AppStore) 
00Big int Dictionary: 1 Adding items  1.00  0.90  0.72 
00Big int Dictionary: 2 Running queries  1.00  0.96  0.81 
00Big int Dictionary: 3 Removing items  1.00  0.96  0.92 
01Big string Dictionary: 0 Ints to strings  1.00  0.79  1.13 
01Big string Dictionary: 1 Adding/setting  1.00  0.91  1.02 
01Big string Dictionary: 2 Running queries  1.00  0.83  1.07 
01Big string Dictionary: 3 Removing items  1.00  0.85  1.05 
02Big int sorted map: 1 Adding items  1.00  1.01  0.89 
02Big int sorted map: 2 Running queries  1.00  1.01  0.90 
02Big int sorted map: 3 Removing items  1.00  1.01  0.77 
03Square root: double  1.00  1.00  1.00 
03Square root: FPL16  1.00  1.02  0.97 
03Square root: uint  1.00  1.03  0.97 
03Square root: ulong  1.00  1.01  0.94 
04Simple arithmetic: double  1.00  1.01  0.51 
04Simple arithmetic: float  1.00  1.01  0.86 
04Simple arithmetic: FPI8  1.00  0.99  1.23 
04Simple arithmetic: FPL16  1.00  1.01  1.55 
04Simple arithmetic: int  1.00  1.02  2.04 
04Simple arithmetic: long  1.00  0.99  0.83 
05Generic sum: double  1.00  1.01  3.34 
05Generic sum: FPI8  1.00  1.01  1.30 
05Generic sum: int  1.00  1.01  0.93 
05Generic sum: int via IMath  1.00  1.01  0.57 
05Generic sum: int without generics  1.00  1.00  1.10 
06Simple parsing: 3 Parse (x1000000)  1.00  1.00  1.00 
06Simple parsing: 4 Sort (x1000000)  1.00  1.00  1.09 
07Trivial method calls: Interface NoOp  1.00  1.00  0.71 
07Trivial method calls: Noinline NoOp  1.00  1.00  1.03 
07Trivial method calls: Static NoOp  1.00  1.00  noop 
07Trivial method calls: Virtual NoOp  1.00  1.11  1.07 
08Matrix multiply:

1.00  1.00  2.43 
08Matrix multiply:

1.00  1.00  2.68 
08Matrix multiply:

1.00  0.95  0.59 
08Matrix multiply:
Array2D 
1.00  1.01  0.64 
08Matrix multiply: double[n*n]  1.00  1.01  0.78 
08Matrix multiply: double[n][n]  1.00  1.01  1.04 
08Matrix multiply: int[n][n]  1.00  1.00  1.21 
09Sudoku  1.00  1.00  1.04 
10Polynomials  1.00  1.00  1.03 
11JGFArithBench  1.00  1.00  23.81 
12JGFAssignBench  1.00  0.80  1.20 
13JGFCastBench  1.00  1.00  1.27 
14JGFCreateBench  1.00  0.98  0.81 
15JGFFFTBench  1.00  1.01  0.99 
16JGFHeapSortBench  1.00  0.99  1.01 
17JGFLoopBench  1.00  1.00  1.04 
18JGFRayTracerBench  1.00  0.98  0.88 
19float4x4 matrix mul, Managed Standard  1.00  1.00  0.63 
20float4x4 matrix mul, Managed unsafe  1.00  1.01  0.96 
21float4x4 matrix mul, Interop Standard  1.00  1.23  1.42 
22float4x4 matrix mul, Interop SSE2  1.00  1.36  1.82 
23managed add  1.00  1.01  7.00 
24managed noinline add  1.00  1.00  1.10 
25interop add  1.00  1.01  1.21 
26interop indirect add  1.00  1.00  2.26 
Quick analysis
We would probably expect a column full of green lights for the .NET Native, but this is unfortunately not the case! Some notes: .NET Native is as efficient as a C++ compiler at coalescing arithmetic instructions (test 11, or 23). Basically the test 23 is able to reduce the addition set of
x+=1, x+=2, x+=3, x+=1, x+=2, x+=3, x+=1
to a singlex+= 1
, resulting in some impressive speedup. Coalescing of instructions is probably the factor that is helping in most tests there.  Some float/double x87 calculations seems to perform badly with .NET Native.
 Pure interop seems slightly more efficient, which is good whenever we are frequently calling native functions (like when using SharpDX/Direct3D11). Note that indirect interop (wrapping a DllImport by another function) is also faster which is great, as It was an issue with current interop that were not inlined by the JIT when they are wrapped, resulting in lots of duplicate prologue/epilogue code for unmanaged/managed transitions (while when it is correctly inlined, consecutive access to interop functions are handled in group when switching context unmanaged/managed)
 Some tests are 2x times slower with .NET Native, though I haven't look at the generated x86 code.
Comparison .NET64 (x64)
Comparison between:
 .NET 64 Desktop
 .NET 64 AppStore
 .NET 64 AppStore Native
 .NET 64 Desktop RyuJit
 In green, results above +10%
 In red, results below 10%
Name  .NET 64 (Desktop) 
.NET 64 (AppStore) 
.NET 64 Native (AppStore) 
.NET 64 RyuJit (Desktop) 
00Big int Dictionary: 1 Adding items  1.00  1.04  1.00  1.02 
00Big int Dictionary: 2 Running queries  1.00  0.91  1.00  0.95 
00Big int Dictionary: 3 Removing items  1.00  1.00  0.95  0.95 
01Big string Dictionary: 0 Ints to strings  1.00  0.69  0.72  1.00 
01Big string Dictionary: 1 Adding/setting  1.00  0.85  0.84  0.99 
01Big string Dictionary: 2 Running queries  1.00  0.82  0.90  0.95 
01Big string Dictionary: 3 Removing items  1.00  0.81  0.91  1.00 
02Big int sorted map: 1 Adding items  1.00  0.98  1.10  1.04 
02Big int sorted map: 2 Running queries  1.00  1.02  1.06  0.97 
02Big int sorted map: 3 Removing items  1.00  1.01  1.02  1.16 
03Square root: double  1.00  1.00  1.00  1.00 
03Square root: FPL16  1.00  1.01  1.15  1.03 
03Square root: uint  1.00  1.00  0.97  0.94 
03Square root: ulong  1.00  1.00  1.15  0.95 
04Simple arithmetic: double  1.00  1.00  4.20  1.10 
04Simple arithmetic: float  1.00  1.00  1.36  0.99 
04Simple arithmetic: FPI8  1.00  1.00  0.91  1.42 
04Simple arithmetic: FPL16  1.00  0.96  1.21  5.19 
04Simple arithmetic: int  1.00  1.00  0.83  0.89 
04Simple arithmetic: long  1.00  1.00  0.96  0.93 
05Generic sum: double  1.00  1.00  1.34  1.33 
05Generic sum: FPI8  1.00  1.00  1.29  1.00 
05Generic sum: int  1.00  0.98  1.28  0.99 
05Generic sum: int via IMath  1.00  1.00  0.65  0.99 
05Generic sum: int without generics  1.00  1.00  1.70  1.00 
06Simple parsing: 3 Parse (x1000000)  1.00  1.00  0.50  1.00 
06Simple parsing: 4 Sort (x1000000)  1.00  0.95  1.30  0.95 
07Trivial method calls: Interface NoOp  1.00  1.00  0.69  0.85 
07Trivial method calls: Noinline NoOp  1.00  0.92  0.96  0.96 
07Trivial method calls: Static NoOp  1.00  1.00  Not Applicable  0.20 
07Trivial method calls: Virtual NoOp  1.00  1.00  0.92  0.74 
08Matrix multiply:

1.00  0.99  1.14  1.17 
08Matrix multiply:

1.00  1.00  5.01  4.95 
08Matrix multiply:

1.00  1.00  1.34  1.16 
08Matrix multiply:
Array2D 
1.00  1.00  3.83  2.75 
08Matrix multiply: double[n*n]  1.00  1.00  1.00  1.00 
08Matrix multiply: double[n][n]  1.00  0.99  0.96  0.98 
08Matrix multiply: int[n][n]  1.00  1.00  1.19  1.12 
09Sudoku  1.00  1.00  1.38  1.48 
10Polynomials  1.00  1.00  0.94  0.99 
11JGFArithBench  1.00  1.00  1.02  1.12 
12JGFAssignBench  1.00  1.00  1.02  0.53 
13JGFCastBench  1.00  1.00  0.99  1.39 
14JGFCreateBench  1.00  0.96  0.81  0.99 
15JGFFFTBench  1.00  1.16  1.18  1.16 
16JGFHeapSortBench  1.00  1.00  1.01  0.99 
17JGFLoopBench  1.00  1.00  1.08  1.01 
18JGFRayTracerBench  1.00  1.00  0.87  1.13 
19float4x4 matrix mul, Managed Standard  1.00  0.99  1.04  1.36 
20float4x4 matrix mul, Managed unsafe  1.00  1.01  0.90  1.00 
21float4x4 matrix mul, Interop Standard  1.00  1.20  1.46  1.03 
22float4x4 matrix mul, Interop SSE2  1.00  1.36  1.92  1.05 
23managed add  1.00  1.00  1.00  4.05 
24managed noinline add  1.00  0.89  1.48  1.00 
25interop add  1.00  0.99  1.17  1.11 
26interop indirect add  1.00  1.00  1.28  0.38 
Quick analysis
Slightly better than x86 code gen, the .NET Native x64 and RyuJit are on average performing better than their JIT counterpart. Some notes: Unexpectedly, coalescing of arithmetic instructions (test 11, or 23) is not happening for .NET Native, but for RyuJit.
 Performance on float/double is better. Most likely SSE registers are better used.
 Sudoku tests is getting a nice +4050% faster with .NET Native and RyuJit
Comparison .NET32 Native vs .NET64 Native
Just use .NET 32 Native as a reference (1.0) and compare it to the .NET 64 Native.Normalized with performance relative to .NET 32 Native. Higher is better. (2.0 means that a test on x64 Native is 2 times faster than x86 )
 In green, results above +10%
 In red, results below 10%
Name  .NET 32 Native (AppStore) 
.NET 32 vs 64 Native (AppStore) 
00Big int Dictionary: 1 Adding items  1.00  1.31 
00Big int Dictionary: 2 Running queries  1.00  1.35 
00Big int Dictionary: 3 Removing items  1.00  1.19 
01Big string Dictionary: 0 Ints to strings  1.00  1.13 
01Big string Dictionary: 1 Adding/setting  1.00  1.22 
01Big string Dictionary: 2 Running queries  1.00  1.17 
01Big string Dictionary: 3 Removing items  1.00  1.16 
02Big int sorted map: 1 Adding items  1.00  1.01 
02Big int sorted map: 2 Running queries  1.00  0.99 
02Big int sorted map: 3 Removing items  1.00  0.96 
03Square root: double  1.00  1.00 
03Square root: FPL16  1.00  2.10 
03Square root: uint  1.00  1.12 
03Square root: ulong  1.00  2.12 
04Simple arithmetic: double  1.00  2.07 
04Simple arithmetic: float  1.00  1.40 
04Simple arithmetic: FPI8  1.00  1.00 
04Simple arithmetic: FPL16  1.00  1.49 
04Simple arithmetic: int  1.00  0.97 
04Simple arithmetic: long  1.00  7.65 
05Generic sum: double  1.00  1.01 
05Generic sum: FPI8  1.00  1.00 
05Generic sum: int  1.00  1.01 
05Generic sum: int via IMath  1.00  1.07 
05Generic sum: int without generics  1.00  1.12 
06Simple parsing: 3 Parse (x1000000)  1.00  1.00 
06Simple parsing: 4 Sort (x1000000)  1.00  1.96 
07Trivial method calls: Interface NoOp  1.00  1.00 
07Trivial method calls: Noinline NoOp  1.00  1.20 
07Trivial method calls: Static NoOp  1.00  not applicable 
07Trivial method calls: Virtual NoOp  1.00  1.16 
08Matrix multiply:

1.00  1.29 
08Matrix multiply:

1.00  1.24 
08Matrix multiply:

1.00  2.38 
08Matrix multiply:
Array2D 
1.00  1.99 
08Matrix multiply: double[n*n]  1.00  1.30 
08Matrix multiply: double[n][n]  1.00  1.00 
08Matrix multiply: int[n][n]  1.00  2.37 
09Sudoku  1.00  1.13 
10Polynomials  1.00  0.99 
11JGFArithBench  1.00  0.61 
12JGFAssignBench  1.00  0.86 
13JGFCastBench  1.00  1.57 
14JGFCreateBench  1.00  0.93 
15JGFFFTBench  1.00  1.05 
16JGFHeapSortBench  1.00  1.08 
17JGFLoopBench  1.00  0.97 
18JGFRayTracerBench  1.00  1.07 
19float4x4 matrix mul, Managed Standard  1.00  1.34 
20float4x4 matrix mul, Managed unsafe  1.00  1.02 
21float4x4 matrix mul, Interop Standard  1.00  1.09 
22float4x4 matrix mul, Interop SSE2  1.00  1.15 
23managed add  1.00  0.16 
24managed noinline add  1.00  1.35 
25interop add  1.00  1.15 
26interop indirect add  1.00  1.15 
Quick analysis
.NET 64 Native code gen is better than .NET 32 Native code gen. Haven't dig into code gen, but more registers for x64 might help optim while x86 is still fighting with a limited set of registers (and x86 code is not using SSE instructions, so it doesn't help). Good to see that interop is also better on x64, while it is not the case for JIT x64 where it is usually much slower.Comparison .NET64 Native vs .NET64 RyuJit
Use .NET 64 Native as a reference (1.0) and compare it to the .NET 64 RyuJit.
Normalized with performance relative to .NET 64 Native. Higher is better. (2.0 means that a test on x64 RyuJit is 2 times faster than x64 Native )
 In green, results above +10%
 In red, results below 10%
Name  .NET 64 Native (AppStore) 
.NET 64 vs 64 RyuJit 
00Big int Dictionary: 1 Adding items  1.00  1.02 
00Big int Dictionary: 2 Running queries  1.00  0.95 
00Big int Dictionary: 3 Removing items  1.00  1.00 
01Big string Dictionary: 0 Ints to strings  1.00  1.38 
01Big string Dictionary: 1 Adding/setting  1.00  1.18 
01Big string Dictionary: 2 Running queries  1.00  1.05 
01Big string Dictionary: 3 Removing items  1.00  1.10 
02Big int sorted map: 1 Adding items  1.00  0.94 
02Big int sorted map: 2 Running queries  1.00  0.91 
02Big int sorted map: 3 Removing items  1.00  1.14 
03Square root: double  1.00  1.00 
03Square root: FPL16  1.00  0.89 
03Square root: uint  1.00  0.97 
03Square root: ulong  1.00  0.83 
04Simple arithmetic: double  1.00  0.26 
04Simple arithmetic: float  1.00  0.73 
04Simple arithmetic: FPI8  1.00  1.56 
04Simple arithmetic: FPL16  1.00  4.31 
04Simple arithmetic: int  1.00  1.07 
04Simple arithmetic: long  1.00  0.96 
05Generic sum: double  1.00  0.99 
05Generic sum: FPI8  1.00  0.78 
05Generic sum: int  1.00  0.78 
05Generic sum: int via IMath  1.00  1.53 
05Generic sum: int without generics  1.00  0.59 
06Simple parsing: 3 Parse (x1000000)  1.00  2.00 
06Simple parsing: 4 Sort (x1000000)  1.00  0.73 
07Trivial method calls: Interface NoOp  1.00  1.24 
07Trivial method calls: Noinline NoOp  1.00  1.00 
07Trivial method calls: Static NoOp  1.00  0.00 
07Trivial method calls: Virtual NoOp  1.00  0.81 
08Matrix multiply:

1.00  1.03 
08Matrix multiply:

1.00  0.99 
08Matrix multiply:

1.00  0.86 
08Matrix multiply:
Array2D 
1.00  0.72 
08Matrix multiply: double[n*n]  1.00  0.99 
08Matrix multiply: double[n][n]  1.00  1.02 
08Matrix multiply: int[n][n]  1.00  0.94 
09Sudoku  1.00  1.07 
10Polynomials  1.00  1.05 
11JGFArithBench  1.00  1.10 
12JGFAssignBench  1.00  0.52 
13JGFCastBench  1.00  1.40 
14JGFCreateBench  1.00  1.23 
15JGFFFTBench  1.00  0.98 
16JGFHeapSortBench  1.00  0.98 
17JGFLoopBench  1.00  0.93 
18JGFRayTracerBench  1.00  1.30 
19float4x4 matrix mul, Managed Standard  1.00  1.31 
20float4x4 matrix mul, Managed unsafe  1.00  1.12 
21float4x4 matrix mul, Interop Standard  1.00  0.71 
22float4x4 matrix mul, Interop SSE2  1.00  0.55 
23managed add  1.00  4.05 
24managed noinline add  1.00  0.68 
25interop add  1.00  0.95 
26interop indirect add  1.00  0.30 
Quick analysis
Surprisingly, RyuJit is performing quite well or sometimes even better than .NET 64 Native. Might be interesting to dig into this.
Summary
As both .NET Native and RyuJit are still in alpha/beta stages, we can't really assert any definitive conclusions here. We can see a trend of improvements in some specific areas, while some tests are still performing a bit worse than the legacy JIT. [Edit]The release of .NET Native Developer Preview 3 on June 30 2014, is showing some improvements in code gen, so .NET Native and RyuJIT are definitely being improved between updates and it is great! [/Edit]
It is good to see.NET 64 getting better and performing well with .NET Native and RyuJit, Until now I have been a bit reluctant at using it, but it looks more robust compare to x86 code gen.
While code gen can be undoubtedly improved with offline compilers or a more modern JIT like RyuJit, we probably can't expect the moon. As I said in this introduction, code gen is only a part of the overall performance cake. The other part, that is most likely not yet covered by these new compiler architectures, is data locality: things like ability to create fat objects  embed instantiation of objects instance into another instance  or creation of short live objects (not value types) on the stack instead of the heap are still areas where .NET can probably be improved. I will hopefully take more time in a next post to explain why this is an important area of improvements and what could be done.
Anyway, this is great to see .NET performance back into the ring! I'm eager also to be able to use .NET Native on desktop.
Thanks for your effort. I'm putting a lot of hope in RyuJIT as it is getting to be a promissing compiler for Managed DirectX game developing.
ReplyDeleteHi Alexandre,
ReplyDeleteMy name is Pooya Zandevakili and I am one of the developers from the .Net Native team at Microsoft. More specifically, I work on code generation and optimizations for both C++ and (now) C#.
Thank you for sharing this information. I would like to reemphasize a point that you have mentioned as well: .Net Native is still in preview and there is still work to be done. We are actively working to make sure that it meets the high quality standards that our customers demand and we will definitely be drilling into the benchmark data you have reported. Community feedback like this is very helpful, so once again thank you. I would also like to encourage you to keep up with our (frequentlyreleased) developer previews as we continually add new improvements. In fact, we just released our third Developer Preview today incorporating other community feedback regarding code quality, which you might find interesting:
Developer Preview 3: http://go.microsoft.com/fwlink/?LinkID=393600
Thank you again and best regards,
Pooya
Thanks Pooya for your feedback. Indeed, I'm glad to see that the new "Developper Preview 3" is improving some benchmarks here. I have added a disclaimer at the beginning of this post to emphasize about the preview cycle
ReplyDeleteCan we see some updated numbers, using Developer Preview 3 ?
ReplyDelete