code4k

Update your bookmarks! This blog is now hosted on http://xoofx.com/blog

Sunday, September 14, 2014

Why Windows UI Matters: Part 2

This post comes into 2 parts:

Typography
Colors and Tiles (this one)

Following my first post about Typography in Windows 8, I would like in this post to talk more about colors and the tile concept in Windows Phone and Windows 8.x.

If you create a new account on your Windows 8.x, you will get the following start screen (on a 27")

First remark, it looks empty. We know the problem for a long time: this was designed for smaller form factors - Windows Phone but unfortunately we got it as-is on desktop.

Second remark, we also start to get a glimpse of the color problem with the Metro/Tile concept: It looks flashy.

Why Windows UI Matters: Part 1

This post comes into 2 parts:

Typography (this one)
Colors and Tiles

You have probably seen some leaks of screenshots of the next Windows and while this could be - or not - the technical preview that is going to be released later this month for enterprise preview, I have been a bit shocked by the poorness of the UI visuals. Lots of commentators are saying that Microsoft is usually tweaking the OS UI few months before releasing to the public, fine, but I hope this technical preview will unveil a better sneak peak of next Windows UI. If it does not, that's really an unfortunate move, because all these images are already generating misunderstanding.

We are lots around waiting and watching for the next Windows. We are not just waiting for the desktop to come back, we are waiting for something that will make us happy and excited! I'm not a UI designer, I'm just a developer using Windows, but I love good visuals and I'm concerned by the OS UI that accompanies me all along more than 12 hours per day!

I won't comment more on the next Windows OS, but I would like to take the opportunity to share some of the unpleasant things I'm feeling about for the past years with the Windows Metro UI era.

Packages vNext: Power-up our .NET Builds

In the sequel of my previous post "Managing multiple platforms in Visual Studio", having done lots of cross-platform development in .NET in the recent years both at work or for SharpDX (with platform specific assemblies, PCL, assemblies using native compiled code... etc) while trying to trick and abuse nuget and msbuild as much as possible, I have realized that in order to provide a smooth integration of "build packages", this require to be more tightly integrated at the core of a build system.

Unfortunately, we only have today a patchwork of this integration, still quite incomplete and far from what it could be, and this is hurting a lot our development process. We really need something brand new here: we have lots of inputs, usecases, and while it is of course not possible to cover every aspects of all build workflows, It is certainly possible to address most of the common issues we are facing today. Let's try to figure out where this could lead!

Micro-benchmarking .NET Native and RyuJit

[Disclaimer] As both RyuJIT and .NET Native are improving a lot between their updates/previews, results of benchmarks in this post may no longer be relevant. Benchmarks were ran with the following versions:

RyuJIT CTP4

.NET Native developer preview 2

While .NET JIT is performing quite well on Windows, it is still behind a fully optimized C++ program, though an efficient compiled program is not only about code generation but also memory management and data locality, the .NET Team has recently introduced two new technologies that might help for the code gen part: The introduction of .NET Native, an offline .NET compiler (similar to ngen, but using the backend optimizer that is used by the C++ compiler) and the next generation of .NET JIT called "RyuJit". In this post I would like to present the results of some micro-benchmarks that are roughly evaluating some performance benefits of these two new technologies.

First of all, you may have already read about a few benchmarks results available around about RyuJit and .NET Native, here is a non-exhaustive list I have found (if you have more pointers, let me know!):

A first look at RyuJIT CTP3 and SIMD (SSE2) support for .NET by Frank Niemeyer
Lies, damn lies and benchmarks by Kevin Frei
.NET Native Performance by Sasha Goldstein

Managing multiple platforms in Visual Studio

Who has not struggled to correctly manage multiple platform configurations in Visual Studio without ending to edit a solution file or tweak some msbuild files by hand? Recently, I decided to cleanup the antique SharpDX.sln in SharpDX that was starting to be a bit fat and not easy to manage. The build is not extremely bizarre there, but as it needs to cover the combinations of NetPlatform x OSPlatform x DirectXVersion x Debug/Release with around 40 projects (without the samples), it is an interesting case of study. It turns out that modifying the solution to make a clean multi-platform build was impossible without hacking msbuild in order to circumvent unfortunate designs found in Microsoft msbuild files (and later to found at work in Xamarin build files as well). In this post, we will go through the gotchas found, and we will see also why Visual Studio should really improve the configuration manager if they want to improve our developers experience.

Advanced HLSL II: Shader compound parameters

A very short post in the sequel of my previous post "Advanced HLSL using closures and function pointers", there is again a little neat trick by using the "class" keyword in HLSL: It is possible to use a class to regroup a set of parameters (shader resources as well as constant buffers) and their associate methods, into what is called a compound parameter. This feature of the language is absolutely not documented, I discovered the name "compound parameter" while trying to hack this technique, as the HLSL compiler was complaining about a restriction about this "compound parameter". So at least, It seems to be implemented up to the point that it is quite usable. Let's see how we can use this...

Going Native 2.0, The future of WinRT

In the recent years, we have seen lots of fuzz about the return of “Going native” after the managed era popularized by Java and .NET. When WinRT was revealed last year, there was some shortsighted comments to claim that “.NET is dead” and to glorify the comeback of the C++, the true and only real way to develop an application, while at the same time, JIT was being more and more introduced in the scripted world (JavaScript being one of the most prominent JIT user). While in the end, everything is going native anyway - the difference being the length of the path to go native and how much optimized it will be - the meaning of the “native” word has slightly shifted to be strongly and implicitly coupled with the word “performance”. Even being a strong advocator for managed language, the performance level is indeed below a well written C++ application, so should we just accept this fact and get back to work with C++, with things like WinRT being the backbone of the interop? To tell you the truth, I want .NET to die and this post is about why and for what.

The Managed Era

Let’s just begin by revisiting recent history of managed development that will highlight current challenges. Remember the Java slogan? “write once runs everywhere”, it was the introduction of a paradigm where a complete “safe” single language-stack based on a virtual machine associated with a large set of API would allow to easily develop an application and target any kind of platforms/OS. It was the beginning of the “managed” era. While Java has been quite successfully adopted in several development industries, it was also quite rejected by lots of developers that were aware of memory management caveats and the JIT not being as optimized as it should be (though they did some impressive improvements over the years) with also a tremendous amount of bad design choice, like the lack of native struct, unsafe access or the route to go native through JNI extremely laborious and inefficient (and even recently, that they were considering to get rid off all native types and make everything an object, what a terrible direction!).

Advanced HLSL using closures and function pointers

Shader languages like HLSL, Cg or GLSL are nowadays driving the most powerful processors in the world, but if you are developing with them, you may have been already a little bit frustrated by one of their expressiveness limitations: the common problem of abstraction and code reuse. In order to overcome this problem, solutions so far were mostly using a glue combination of #define/#include preprocessors directives in order to generate combinations of code, permutation of shaders, so called UberShaders. Recently, this problem has been addressed, for HLSL (new in Direct3D11), by providing the concept of Dynamic Linking, and for GLSL, the concept of SubRoutines, For Direct3D11, the new mechanism has been only available for Shader Model 5.0, meaning that even if this could greatly simplified the problem of abstraction, It is unfortunately only available for Direct3D11 class graphics card, which is of course a huge limitation...

But, here is the good news: While the classic usage of dynamic linking is not really possible from earlier version (like SM4.0 or SM3.0), I have found an interesting hack to bring some kind of closures and functions pointers to HLSL(!). This solution doesn't involve any kind of preprocessing directive and is able to work with SM3.0 and SM4.0, so It might be interesting for folks like me that like to abstract and reuse the code as often as possible! But let's see how It can be achieved...

Direct3D11 multithreading micro-benchmark in C# with SharpDX

Multi-threading is an important feature added to Direct3D11 two years ago and has been increasingly used on recent game engine in order to achieve better performance on PC. You can have a look at "DirectX 11 Rendering in Battlefield 3" from Johan Anderson/DICE which gives a great insight about how it was effectively used in practice in their game engine. Usage of the Direct3D11 multithreading API is pretty straightforward, and while we are also using it successfully at our work in our R&D 3D Engine, I didn't take the time to sit down with this feature and check how to get the best of it.

I recently came across a question on the gamedev forum about "[DX11] Command Lists on a Single Threaded Renderer": If command lists are an efficient way to store replayable drawing commands, would it be efficient to use them even in a single threaded scenario where lots of drawing commands are repeatable?

In order to verify this, among other things, I did a simple micro-benchmark using C#/SharpDX, but while the results are somehow expectable, there are a couple of gotchas that deserve a more in-depth look...

Managed DirectX with Win8 Metro Style App

Since my last post is quite old, it's time to give more insight about the latest features included in the recent release of SharpDX 2.0 beta. This release is a major step for SharpDX as it includes lots of new APIs but also provides a preview of developing Managed DirectX from a Windows 8 Metro style application. In the meantime, I have been pretty busy as I got a job in a gamedev company located in Japan that is developing a new rendering engine in C#... which is using SharpDX for its windows rendering, and that's extremely exciting to work on a project like this (and of course rewarding for all the investement done in SharpDX!).

While SharpDX is starting to cover almost the whole DirectX API as well as some new Windows Multimedia API (like WIC), I put some effort, just after the Microsoft Build conference, on providing a preview of using SharpDX from a Win8 Metro style application, which is a fantastic opportunity for a C#/.Net developer to use DirectX from both a Metro style application and Desktop application using the same API.

Finally SharpDX is also going to get some new APIs and some interesting features in the following months, especially in the performance domain, to reduce as much as possible the difference of performance between a native C++ DirectX application and a managed C# DirectX application, that will deserve a post on its own. But lets start with what has been achieved in the latest 2.0 version!

Benchmarking C#/.Net Direct3D 11 APIs vs native C++

[Update 2012/05/15: Note that the original code was fine tuned to a particular config and may not give you the same results. I have rewritten this sample to give more accurate and predictible results. Comparison with XNA was also not fair and inaccurate, you should have something like x4 slower instead of x9. SharpDX latest version 2.1.0 is also x1.35 slower than C++ now. An update of this article will follow on new sharpdx.org website]
[Update 2014/06/17: Remove XNA comparison, as it is not fair and relevant]

If you are working with a managed language like C# and you are concerned by performance, you probably know that, even if the Microsoft JIT CLR is quite efficient, It has a significant cost over a pure C++ implementation. If you don't know much about this cost, you have probably heard about a mean cost for managed languages around 15-20%. If you are really concern by this, and depending on the cases, you know that the reality of a calculation-intensive managed application is more often around x2 or even x3 slower than its C++ counterpart. In this post, I'm going to present a micro-benchmark that measure the cost of calling a native Direct3D 11 API from a C# application, using various API, ranging from SharpDX, SlimDX, WindowsCodecPack.

Why this benchmark is important? Well, if you intend like me to build some serious 3D games with a C# managed API (don't troll me on this! ;) ), you need to know exactly what is the cost of calling intensively a native Direct3D API (mainly, the cost of the interop between a managed language and a native API) from a managed language. If your game is GPU bounded, you are unlikely to see any differences here. But if you want to apply lots of effects, with various models, particles, materials, playing with several rendering targets and a heavy deferred rendering technique, you are likely to perform lots of draw calls to the Direct3D API. For a AAA game, those calls could be as high as 3000-7000 draw submissions in instancing scenarios (look at latest great DICE publications in "DirectX 11 Rendering in Battlefield 3" from Johan Andersson). If you are running at 60fps (or lower 30fps), you just have 17ms (or 34ms) per frame to perform your whole rendering. In this short time range, drawing calls can take a significant amount of time, and this is a main reason why multi-threading batching command were introduced in DirectX11. We won't use such a technique here, as we want to evaluate raw calls.

As you are going to see, results are pretty interesting for someone that is concerned by performance and writing C# games (or even efficient tools for a 3D Middleware)

Crinkler secrets, 4k intro executable compressor at its best

(Edit 5 Jan 2011: New Compression results section and small crinkler x86 decompressor analysis)

If you are not familiar with 4k intros, you may wonder how things are organized at the executable level to achieve this kind of packing-performance. Probably the most important and essential aspect of 4k-64k intros is the compressor, and surprisingly, 4k intros have been well equipped for the past five years, as Crinkler is the best compressor developed so far for this category. It has been created by Blueberry (Loonies) and Mentor (tbc), two of the greatest demomakers around.

Last year, I started to learn a bit more about the compression technique used in Crinkler. It started from some pouet's comments that intrigued me, like "crinkler needs several hundred of mega-bytes to compress/decompress a 4k intros" (wow) or "when you want to compress an executable, It can take hours, depending on the compressor parameters"... I observed also bad comrpession result, while trying to convert some part of C++ code to asm code using crinkler... With this silly question, I realized that in order to achieve better compression ratio, you better need a code that is comrpession friendly but is not necessarily smaller. Or in other term, the smaller asm code is not always the best candidate for better compression under crinkler... so right, I needed to understand how crinkler was working in order to code crinkler-friendly code...

I just had a basic knowledge about compression, probably the last book I bought about compression was more than 15 years ago to make a presentation about jpeg compression for a physics courses (that was a way to talk about computer related things in a non-computer course!)... I remember that I didn't go further in the book, and stopped just before arithmetic encoding. Too bad, that's exactly one part of crinkler's compression technique, and has been widely used for the past few years (and studied for the past 40 years!), especially in compressors like H.264!

So wow, It took me a substantial amount of time to jump again on the compressor's train and to read all those complicated-statistical articles to understand how things are working... but that was worth it! In the same time, I spent a bit of my time to dissect crinkler's decompressor, extract the code decompressor in order to comment it and to compare its implementation with my little-own-test in this field... I had a great time to do this, although, in the end, I found that whatever I could do, under 4k, Crinkler is probably the best compressor ever.

You will find here an attempt to explain a little bit more what's behind Crinkler. I'm far from being a compressor expert, so if you are familiar with context-modeling, this post may sounds a bit light, but I'm sure It could be of some interest for people like me, that are discovering things like this and want to understand how they make 4k intros possible!

Official release of SharpDX 1.0

After three months of intense development, I'm really excited to announce the availability of SharpDX 1.0 , a new platform independent .Net managed DirectX API, directly generated from DirectX SDK headers.

This first version can be considered as stable. The Direct3D10 / Direct3D10.1 API has been entirely tested on a large 3D engine that was using previously SlimDX (thanks patapom!). Migration was quite straightforward, with tiny minor changes to the engine's code.

The key features and benefits of this new API are:

API is generated from DirectX SDK headers : meaning a complete and reliable API and an easy support for future API.
Full support for the following DirectX API:

Direct3D10
Direct3D10.1
Direct3D11
Direct2D1 (including custom rendering, tessellation callbacks)
DirectWrite (including custom client callbacks)
D3DCompiler
DXGI
DXGI 1.1
DirectSound
XAudio2
XAPO
An integrated math API directly ported from SlimMath

Pure managed .NET API, platform independent : assemblies are compiled with AnyCpu target. You can run your code on a x64 or a x86 machine with the same assemblies, without recompiling your project.
Lightweight individual assemblies : a core assembly - SharpDX - containing common classes and an assembly for each subgroup API (Direct3D10, Direct3D11, DXGI, D3DCompiler...etc.). Assemblies are also lightweight.
C++/CLI Speed : the framework is using a genuine way to avoid any C++/CLI while still achieving comparable performance.
API naming convention mostly compatible with SlimDX API.
Raw DirectX object life management : No overhead of ObjectTable or RCW mechanism, the API is using direct native management with classic COM method "Release".
Easily mergeable / obfuscatable : If you need to obfuscate SharpDX assemblies, they are easily obfusctable due to the fact the framework is not using any mixed assemblies. You can also merge SharpDX assemblies into a single exe using with tool like ILMerge.

You will also find a growing collection of samples in the Samples Gallery of SharpDX. Most notably with some additional support for Direct2D1 and DirectWrite client callbacks.

Instead of providing a monolithic assembly, SharpDX is providing lightweight individual and interdependent assemblies. All SharpDX assemblies are dependent from the core SharpDX assembly. You just need to add the required assemblies to your project, without embedding the whole DirectX API stack. Here is a chart that explains SharpDX assembly dependencies:

Next versions will provide support for DirectInput, XInput, X3DAudio, XACT3.

About performance

Someone asked me how SharpDX compares to SlimDX in terms of performance. Here is a micro-benchmark on two methods, ID3D10Device1::GetFeatureLevel (alias Device.FeatureLevel) and ID3D10Device::CheckCounterInfo (alias Device.GetCounterCapabilities).

The test consist of 100,000,000 calls on each methods (inside a for, with (10 calls to device.FeatureLevel) * 10,000,000 times) and is repeated 10 times and averaged. Repeated two times.

Method	SlimDX	SharpDX	SharpDX vs SlimDX
device.FeatureLevel	3700	3650	1,37%
device.GetCounterCapabilities()	4684	4259	9,98%

For FeatureLevel, the test was sometimes around +/-0.5%.
For GetCounterCapabilities(), the main difference between SharpDX and SlimDX implementation is that SlimDX perform a copy from the native struct to .Net struct while SharpDX is directly passing a pointer to the .Net struct.

This test is of course a micro benchmark and doesn't reflect a real-world usage. Some part of the API could be in favor of SlimDX, but I'm pretty confident that SharpDX is much more consistent in the way structures are passed to the native functions, avoiding as much as possible marshaling structures that doesn't need any custom marshaling (unlike SlimDX that is performing most of a time a marshaling between .Net/Native structure, besides they are binary compatible).

Next?

Finally, I'm going to be able to use this project to make some demos with it! Next target is to develop a XNA like based framework based on SharpDX.Direct3D11.

Stay tuned!

Thursday, November 18, 2010

SharpDX, a new managed .Net DirectX API available

If you have followed my previous work on a new .NET API for Direct3D 11, I proposed SlimDX team this solution for the v2 of their framework, joined their team around one month ago, and I was actively working to widen the coverage of the DirectX API. I have been able to extend the API coverage almost up to the whole API, being able to develop Direct2D samples, as well as XAudio2 and XAPO samples using it. But due to some incompatible directions that the SlimDX team wanted to follow, I have decided to release also my work under a separate project called SharpDX. Now, you may wonder why I'm releasing this new API under a separate project from SlimDX?

Well, I have been working really hard on this from the beginning of September, and I explained why in my previous post about Direct3D 11. I have checked-in lots of code under the v2 branch on SlimDX, while having lots of discussion with the team (mostly Josh which is mostly responsible for v2) on their devel mailing list. The reason I'm leaving SlimDX team is that It was in fact not clear for me that I was not enrolled as part of the decision for the v2 directions, although I was bringing a whole solution (by "whole", I mean a large proof of concept, not something robust, finished). At some point, Josh told me that Promit, Mike and himself, co-founders of SlimDX, were the technical leaders of this project and they would have the last word on the direction as well as for decisions on the v2 API.

Unfortunately, I was not expecting to work in such terms with them, considering that I had already made 100% of the whole engineering prototype for the next API. From the last few days, we had lots of -small- technical discussions, but for some of them, I clearly didn't agree about the decisions that were taken, whatever the arguments I was trying to give to them. This is a bit of disappointment for me, but well, that's life of open source projects. This is their project and they have other plans for it. So, I have decided to release the project on my own with SharpDX although you will see that the code is also currently exactly the same on the v2 branch of SlimDX (of course, because until yesterday, I was working on the SlimDX v2 branch).

But things are going to change for both projects : SlimDX is taking the robust way (for which I agree) but with some decisions that I don't agree (in terms of implementation and direction). Although, as It may sound weird, SharpDX is not intended to compete with SlimDX v2 : They have clearly a different scope (supporting for example Direct3D 9, which I don't really care in fact), different target and also different view on exposing the API and a large existing community already on SlimDX. So SharpDX is primarily intended for my own work on demomaking. Nothing more. I'm releasing it, because SlimDX v2 is not going to be available soon, even for an alpha version. On my side, I'm considering that the current state (although far to be as clean as It should be) of the SharpDX API is usable and I'm going to use it on my own, while improving the generator and parser, to make the code safer and more robust.

So, I did lots of work to bring new API into this system, including :

Direct3D 10
Direct3D 10.1
Direct3D 11
Direct2D 1
DirectWrite
DXGI
DXGI 1.1
D3DCompiler
DirectSound
XAudio2
XAPO

And I have been working also on some nice samples, for example using Direct2D and Direct3D 10, including the usage of the tessellate Direct2D API, in order to see how well It works compared to the gluTessellation methods that are most commonly used. You will find that the code is extremely simple in SharpDX to do such a thing :

using System;
using System.Drawing;
using SharpDX.Direct2D1;
using SharpDX.Samples;

namespace TessellateApp
{
    /// 
    /// Direct2D1 Tessellate Demo.
    /// 
    public class Program : Direct2D1DemoApp, TessellationSink
    {
        EllipseGeometry Ellipse { get; set; }
        PathGeometry TesselatedGeometry{ get; set; }
        GeometrySink GeometrySink { get; set; }

        protected override void Initialize(DemoConfiguration demoConfiguration)
        {
            base.Initialize(demoConfiguration);

            // Create an ellipse
            Ellipse = new EllipseGeometry(Factory2D,
                                          new Ellipse(new PointF(demoConfiguration.Width/2, demoConfiguration.Height/2), demoConfiguration.Width/2 - 100,
                                                      demoConfiguration.Height/2 - 100));

            // Populate a PathGeometry from Ellipse tessellation 
            TesselatedGeometry = new PathGeometry(Factory2D);
            GeometrySink = TesselatedGeometry.Open();
            // Force RoundLineJoin otherwise the tesselated looks buggy at line joins
            GeometrySink.SetSegmentFlags(PathSegment.ForceRoundLineJoin); 

            // Tesselate the ellipse to our TessellationSink
            Ellipse.Tessellate(1, this);

            // Close the GeometrySink
            GeometrySink.Close();
        }


        protected override void Draw(DemoTime time)
        {
            base.Draw(time);

            // Draw the TextLayout
            RenderTarget2D.DrawGeometry(TesselatedGeometry, SceneColorBrush, 1, null);
        }

        void TessellationSink.AddTriangles(Triangle[] triangles)
        {
            // Add Tessellated triangles to the opened GeometrySink
            foreach (var triangle in triangles)
            {
                GeometrySink.BeginFigure(triangle.Point1, FigureBegin.Filled);
                GeometrySink.AddLine(triangle.Point2);
                GeometrySink.AddLine(triangle.Point3);
                GeometrySink.EndFigure(FigureEnd.Closed);                
            }
        }

        void TessellationSink.Close()
        {            
        }

        [STAThread]
        static void Main(string[] args)
        {
            Program program = new Program();
            program.Run(new DemoConfiguration("SharpDX Direct2D1 Tessellate Demo"));
        }
    }
}

This simple example is producing the following ouput :

which is pretty cool, considering the amount of code (although the Direct3D 10 and D2D initialization part would give a larger code), I found this to be much simpler than the gluTessellation API.

You will find also some other samples, like the XAudio2 ones, generating a synthesized sound with the usage of the reverb, and even some custom XAPO sound processors!

You can grab those samples on SharpDX code repository (there is a SharpDXBinAndSamples.zip with a working solutions with all the samples I have been developing so far, with also MiniTris sample from SlimDX).

Wednesday, November 3, 2010

Hacking Direct2D to use directly Direct3D 11 instead of Direct3D 10.1 API

Disclaimer about this hack: This hack was nothing more than a proof of concept and I *really* don't have time to dig into any kind of bugs related to it.

[Edit]13 Jan 2011, After Windows Update KB2454826, this hack was not working. I have patched the sample to make it work again. Of course, you shouldn't consider this hack for anykind of production use. Use the standard DXGI shared sync keyed mutex instead. This hack is just for fun![/Edit]

If you know Direct3D 11 and Direct 2D - they were released almost at the same time - you already know that there is a huge drawback to use Direct 2D : It's in fact only working with Direct3D 10.1 API (although It's working with older hardware thanks to the new feature level capability of the API).

From a coding user point of view, this is really disappointing that such a good API doesn't rely on the latest Direct3D API... moreover when you know that the Direct3D 11 API is really close to the Direct3D 10.1 API... In the end, more work are required for a developer that would like to work with Direct3D 11, as It doesn't have any more Text API for example, meaning that in D3D11, you have to do it yourself, which isn't a huge task itself, if you go to the easy precalculated-texture-of-fonts generated by some GDI+ calls or whatever, but still... this is annoying specially when you need to display some information/FPS on the screen and you can't wait to build a nice font-texture-based system...

I'm not completely fair with Direct2D interoperability with Direct3D 11 : there is in fact a well known solution proposed by one guy from DirectX Team that imply the use of DXGI mutex to synchronized a surface shared between D3D10.1 and D3D11. I was expecting this issue to be solved in some DirectX SDK release this year, but It seems that there is no plan to release in the near future an update for Direct2D (see my question in the comments and the anwser...)... WP7 and XNA are probably getting much more attention here...

So last week, I took some time on the Direct2D API and found that It's in fact fairly easy to hack Direct2D and redirect all the D3D10.1 API calls to a real Direct3D 11 instance... and this is a pretty cool news! Here is the story of this little hack...

Implementing an unmanaged C++ interface callback in C#/.Net

Ever wanted to implement a C++ interface callback in a managed C# application? Well, although that's not so hard, this is a solution that you will probably hardly find over the Internet... the most common answer you will get is that it's not possible to do it or you should use C++/CLI in order to achieve it... In fact, in C#, you can only implement a C function delegate through the use of Marshal.GetFunctionPointerForDelegate but you won't find anything like Marshal.GetInterfacePointerFromInterface. You may wonder why do I need such a thing?

In my previous post about implementing a new DirectX fully managed API, I forgot to mention the case of interfaces callbacks. There are not so many cases in Direct3D 11 API where you need to implement a callback. You will more likely find more use-cases in audio APIs like XAudio2, but in Direct3D 11, afaik, you will only find 3 interfaces that are used for callback:

ID3DInclude which is used by D3DCompiler API in order to provide a callback for includes while using preprocessor or compiler API (see for example D3DCompile).
ID3DX11DataLoader and ID3DX11DataProcessor, which are used by some D3DX functions in order to perform asynchronous loading/processing of texture resources. The nice thing about C# is that those interfaces are useless, as it is much easier and trivial to directly implement them in C# instead

So I'm going to take the example of ID3DInclude, and how It has been successfully implemented for the SharpDX.

High performance memcpy gotchas in C#

(Edit 8 Jan 2011: Update protocol test with Buffer.BlockCopy)
(Edit 11 Oct 2012: Please vote for the x86 cpblk deficiency on Microsoft Connect)
Following my last post about an interesting use of the "cpblk" IL instruction as an unmanaged memcpy replacement, I have to admit that I didn't take the time to carefully verify that performance is actually better. Well, I was probably too optimistic... so I have made some tests and the results are very surprising and not expected to be like these...

The memcpy protocol test in C#

When dealing with 3D calculations, large buffers of textures, audio synthesizing or whatever requires a memcpy and interaction with unmanaged world, you will most notably end up with a call to an unmanaged functions like this one:

[DllImport("msvcrt.dll", EntryPoint = "memcpy", CallingConvention = CallingConvention.Cdecl, SetLastError = false), SuppressUnmanagedCodeSecurity]
public static unsafe extern void* CopyMemory(void* dest, void* src, ulong count);

In this test, I'm going to compare this implementation with 4 challengers :

The cpblk IL instruction
A handmade memcpy function
Array.Copy, although It's not relevant because they don't have the same scope. Array.Copy is managed only for arrays only while memcpy is used to copy portion of datas between managed-unmanaged as well as unmanaged-unmanaged memory.
Marshal.Copy, same as Array.Copy
Buffer.BlockCopy, which is working on managed array but is working with a byte size block copy.

The test is performing a series of memcpy with different size of block : from 4 bytes to 2Mo. The interesting part is to run this test on a x86 and x64 mode. Both tests are running on the same Windows 7 OS x64, same machine Intel Core I5 750 (2.66Ghz). The CLR used for this is the Runtime v4.0.30319.

The naive handmade memcpy is nothing more than this code (not to be the best implem ever but at least safe for any kind of buffer size):

static unsafe void CustomCopy(void * dest, void* src, int count)
{
    int block;

    block = count >> 3;

    long* pDest = (long*)dest;
    long* pSrc = (long*)src;

    for (int i = 0; i < block; i++)
    {
        *pDest = *pSrc; pDest++; pSrc++;
    }
    dest = pDest;
    src = pSrc;
    count = count - (block << 3);

    if (count > 0)
    {
        byte* pDestB = (byte*) dest;
        byte* pSrcB = (byte*) src;
        for (int i = 0; i < count; i++)
        {
            *pDestB = *pSrcB; pDestB++; pSrcB++;
        }
    }
}

Results

For the x86 architecture, results are expressed as a throughput in Mo/s - higher is better, blocksize is in bytes :

BlockSize	x86-cpblk	x86-memcpy	x86-CustomCopy	x86-Array.Copy	x86-Marshal.Copy	x86-BlockCopy
4	146	458	470	85	81	150
8	294	843	1122	168	167	298
16	587	1628	1904	306	327	577
32	950	1876	3184	631	558	1079
64	1451	3316	4295	1205	1059	1981
128	2245	5161	4848	2176	1933	3386
256	4353	7032	5333	3699	3386	5333
512	8205	13617	5517	5663	6666	7441
1024	13617	20000	6666	7710	12075	9275
2048	18823	24615	7191	9142	16842	9552
4096	2922	7529	5663	10491	7032	11034
8192	2990	7804	5714	11228	7441	11636
16384	2857	7901	5614	9142	7619	10322
32768	2379	6736	5333	8101	6666	8205
65536	2379	6808	5470	8205	6808	8205
131072	2509	17777	5818	8101	17777	8101
262144	2500	11636	5423	7032	11428	7111
524288	2539	11428	5423	7111	11428	7111
1048576	2539	11428	5470	7032	11428	7111
2097152	2529	11428	5333	7032	11034	6881

For the x64 architecture:

BlockSize2	x64-cpblk	x64-memcpy	x64-CustomCopy	x64-Array.Copy	x64-Marshal.Copy	x64-BlockCopy
4	583	346	599	99	111	219
8	1509	770	1876	212	224	469
16	2689	1451	3316	417	422	903
32	4705	2666	5000	802	864	1739
64	8205	4812	7272	1568	1748	3350
128	13333	8101	9014	3004	3184	6037
256	18823	11428	10000	5470	5245	8648
512	22068	16000	10491	9014	9552	13913
1024	22857	19393	7356	13333	13617	16842
2048	23703	21333	7710	17297	17777	20645
4096	23703	22068	7804	19393	20000	21333
8192	23703	22857	7619	22068	22068	22857
16384	23703	22857	7804	17297	21333	18285
32768	16410	16410	7710	12800	16000	12800
65536	13061	14883	7710	13061	14545	13061
131072	14222	13913	7710	12800	13617	12800
262144	5000	5039	7032	7901	5000	7804
524288	5079	5000	7356	8205	5079	7804
1048576	4885	4885	7272	7441	4671	7529
2097152	5039	5079	7272	7619	5000	7710

Graph comparison only for cpblk, memcpy and CustomCopy:

Don't be afraid about the performance drop for most of the implem... It's mostly due to cache missing and copying around different 4k pages.

Conclusion

Don't trust your .NET VM, check your code on both x86 and x64. It's interesting to see how much the same task is implemented differently inside the CLR (see Marshal.Copy vs Array.Copy vs Buffer.Copy)

The most surprising result here is the poor performance of cpblk IL instruction in x86 mode compare to the best one in x64 which is... cpblk. So to summarize:

On x86, you should better use a memcpy function
On x64, you should better use a cpblk function, which is performing better from small size (twice faster than memcpy) to large size.

You may wonder why the x86 version is so unoptimized? This is because the x86 CLR is generating a x86 instruction that is performing a memcpy on a PER BYTE basis (rep movb for x86 folks), even if you are moving a large memory chunk of 1Mo! In comparison, a memcpy as implemented in MSVCRT is able to use SSE instructions that are able to batch copy with large 128 bits registers (with also an optimized case for not poluting CPU cache). This is the case for x64 that seems to use a correct implemented memcpy, but the x86 CLR memcpy is just poorly implemented. Please vote for this bug described on Microsoft Connect.

One important consequence of this is when you are developping a C++/CLI and calling a memcpy from a managed function... It will end up in a cpblk copy functions... which is almost the worst case on x86 platforms... so be careful if you are dealing with this kind of issue. To avoir this, you have to force the compiler to use the function from the MSVCRTxx.dll.

Of course, the memcpy is platform dependent, which would not be an option for all...

Also, I didn't perform this test on a CLR 2 runtime... we could be surprised as well... There is also one thing that I should try against a pure C++ memcpy using the optimized SSE2 version that is shipped with later msvcrt.

You can download the VS2010 project from here

code4k

Sunday, September 14, 2014

Saturday, September 13, 2014

Thursday, August 14, 2014

Monday, June 16, 2014

Sunday, May 25, 2014

Saturday, April 20, 2013

Wednesday, August 8, 2012

The Managed Era

Thursday, November 24, 2011

Friday, November 18, 2011

Sunday, October 2, 2011

Tuesday, March 15, 2011

Wednesday, December 29, 2010

Wednesday, December 1, 2010