code4k

Why Windows UI Matters: Part 2

2014-09-14T23:58:00.000+11:00

This post comes into 2 parts:

Typography
Colors and Tiles (this one)

Following my first post about Typography in Windows 8, I would like in this post to talk more about colors and the tile concept in Windows Phone and Windows 8.x.

If you create a new account on your Windows 8.x, you will get the following start screen (on a 27")

First remark, it looks empty. We know the problem for a long time: this was designed for smaller form factors - Windows Phone but unfortunately we got it as-is on desktop.

Second remark, we also start to get a glimpse of the color problem with the Metro/Tile concept: It looks flashy.

Back in 2012, when Windows 8 UI was revealed, one of the first joke we quickly got on social networks was a comparison with an AOL screenshot from 1996:

And when you see this kind of comparisons immediately popping up all around, It should ring the bell...

But let's just zoom in the Windows 8 tiles area:

If you are a developer, designer or working on colorization, you will immediately see that some colors have been pushed close to the range #FF0000 or #00FF00 or #0000FF. Visually, colors on this screen are attacking us.

Colors

I have put this image into the "Image Color Summarizer" by Martin Krzywinski:

RGB (Win8)

Value (Win8)

Saturation (Win8)

Hue (Win8)

If we pay more attention to the hue and saturation:

For the hue: We are getting several spikes over the whole range. Red, Green, Blue, Pink are popping up.
For the saturation: We are getting mostly high values, meaning that the image is highly saturated.

This is the immediate feeling we have when looking at this image. This is confirmed by some naive analysis. On Windows 8, dominant colors are coming from the tiles.

If we perform this kind of analysis with the front-page of an iPad, we will get completely different results: most notably, we won't get these spikes because it depends on the wallpaper image, so usually, we don't put a #FF0000 image on the background:

iPad "Start" Screen

On the iPad screenshot, we can see that icons are colorful, but they are not attacking us, because on the iPad the dominant color is coming from the background. The image is more desaturated than saturated, and we can effectively confirm that spikes don't spawn everywhere in the hue range:

Saturation (iPad)

Hue (iPad)

Again, I'm not an expert in design and UI interface, but I don't feel that vibrant hue/saturation fits well for a welcome/main desktop screen. I don't believe also that this is the only way to provide a flat design interface. Or is it because of its flatness that they need to over-exaggerate colors to make it less "emotionally flat"?

The use of different hues in the same image with lots of saturated colors could work for a one-time-visit website/logo... but for our welcome OS screen that we are visiting several times a day...it is nothing less than highly repulsive (hence one reason I have setup to boot directly on desktop and have never been using a single Metro app for 2 years). Checking a bit more about saturated color in design guidelines, I found one article "When to Use Saturated Colors?" by Curtis Newbold, and he recommends using saturated colors when:

You need to attract attention: We don't need our start/welcome screem to attract our attention. No, thanks. We are using it mainly for working.
You want to create an exciting atmosphere: don't think this is good either...
You want to simplify emotional response: I believe that the emotional response so far for the start screen has not been good

But

Too many saturated colors next to each other can cause eye fatigue [...]

Probably we should insist more that we don't want our welcome/start screen to look like a shiny-marketing-website?

Tiles

But, this is not all about colors... Tiles are actually accentuating the problem. Instead of having colorful small icons, we get monochromatic big rectangles that are covering several hundred of pixels on our screen. We can't escape from the space they cover!

The background is completely hidden. We can't customize this. On our good old desktop (or hey, an iPad "Start" screen looks like our good old Windows Desktop with organized icons no?), we can put a wallpaper, and can have icons on top of it, but icons will not hide the wallpaper. On Windows 8 Start Screen, you really don't have the choice.

In Windows Phone 8.1, they are trying to get around this problem by allowing to put a background image in the tiles...not sure it is working better. Here, a screen shot of a Windows Phone 8.1, WP8 with a background Android, and finally, a screen shot from an Android device:

Windows Phone 8.1

Windows Phone 8.1 + Android bg

Android

Maybe for a phone, the Metro UI concept could make sense for some people. While, at the beginning, I found it attractive and "distinctive", I felt quickly bored by the overall theming. Lots of tiles are using the main background color, so we have lots of applications that are a bit more difficult to distinct with others (only by the monochromatic white icon). We could choose to place icons based on their colors, but I'm usually placing icons based on their placement relative to my fingers. I'm also not particularly convinced by the horizontal/vertical black lines as it looks like less smoother than a clean continuous background.

I was not supposed to talk about next Windows UI, but let's be clear, when I saw some screenshots at Microsoft Build last year about making the tiles integrated on a new Franksteinesk Start Menu, I was frankly not really happy about the idea :

Windows 9 Tech Preview Leak (Source: WinFuture.de)

I don't know if you are like me about this, but I would really love our main Windows OS playground going back to some fundamentals:

Just get rid of the Metro UI, tiles, colors... as they don't match with the overall UI of all other parts of the OS and applications we are using
Better leverage on the desktop area. May be with virtual desktops, it could make more sense (at some point, it will look like an iPad!... oh boy, who could believe this?)
Add more power to the Taskbar. Just one example: sub-folders/apps group (this is not new, I know!)

There are also a bunch of old Windows UI discrepancies floating all around that would deserve a separate post. Hope I will have enough motivation to write it, but I need to go back coding, enough blabla about UI design!

Why Windows UI Matters: Part 1

2014-09-13T03:03:00.000+11:00

This post comes into 2 parts:

Typography (this one)
Colors and Tiles

You have probably seen some leaks of screenshots of the next Windows and while this could be - or not - the technical preview that is going to be released later this month for enterprise preview, I have been a bit shocked by the poorness of the UI visuals. Lots of commentators are saying that Microsoft is usually tweaking the OS UI few months before releasing to the public, fine, but I hope this technical preview will unveil a better sneak peak of next Windows UI. If it does not, that's really an unfortunate move, because all these images are already generating misunderstanding.

We are lots around waiting and watching for the next Windows. We are not just waiting for the desktop to come back, we are waiting for something that will make us happy and excited! I'm not a UI designer, I'm just a developer using Windows, but I love good visuals and I'm concerned by the OS UI that accompanies me all along more than 12 hours per day!

I won't comment more on the next Windows OS, but I would like to take the opportunity to share some of the unpleasant things I'm feeling about for the past years with the Windows Metro UI era.

Preamble

I remember reading the old document of the Windows Phone design philosophy when it came out (I have found just one around) and the first two major pillars of the Metro philosophy were :

1) Clean, light, open, and fast: It is visually distinctive, contains ample white space, reduces clutter and elevates typography as a key design element
2) Content, not chrome: It accentuates focus on the content that the user cares most about, making the product simple and approachable for everyone

That's the original seed and... original sin.

First, considering that a UI OS is like a transit area, a metro or an airport. I barely spend a whole day in a metro/airport and I don't think I would love to. While this design philosophy can be successful for some applications (like news, because this is one place where the content is more important than the chrome), it is questionable to apply it everywhere.

Do you typography?

Let's take one of my favorite example: The settings of Windows Phone and this apply to some extend to the settings of Windows 8.x. Here are the screenshots of all the settings that are scrollable on a windows developer phone I have:

It is composed of 7 different screens (!), 48 individual plain text entries just for the front settings! I have just added the second screen for applications, but some sub-screen-settings are suffering the same syndrome...

I have found this UI to be one of the most unpleasant settings area I have been using for the past years over the three major mobile OS. Every time I have to use this area, I'm struggling to scroll down, scroll up, blinking my eyes to find - and miss the entry I'm looking for. It was even worse with Windows Phone 8, as it didn't have an accessible notification/simplified control center from the home screen, so everyday, I had to deep dive into these settings to find the flight mode (2nd screen, first line!)

This is where typography is abused. It has only content, and absolutely no chrome. Where are the:

categories?
icons?
colors?

Let's just have a look at the Windows 8.1 version.

Not surprising whenever I have to change settings on my PC, I'm still going to the good old control panel. While it is lacking flat designs and refreshed icons, it is still much easier to access all your settings ( and you have much more there), than going around the new Windows 8.1 Metro settings.

Apart for the categories, we feel cold about colors and icons. Why banning them so hard?

Let's just have a look on the settings on my Android device:

Settings on an Android

Visually, it is a bit more pleasant, even if I'm not a fan of the toggle buttons (not really a flat design), but overall, it is functionally a lot more usable. My grandpa with its weak sight is much more able to handle these settings than the one from Windows.

Perceptually, psychologically, without being an expert, I believe this is wrong. Many people, probably starting by myself, are not comfortable with pure text, language, reading...etc. My kids that are not yet able to read would not be able to navigate in these settings (even if they should not have to!).

Our brain have different ways of deciphering information, based on text, form, colors, spacial placement, sound. Some people can perfectly handle all this text, some can't. Leveraging only on typography, on a single axis of chrome, is making some people confused about this.

This division of content and chrome is hurting more than it sounds, drying the chrome is drying the content! It is essential that the substance and form come together.

For the next part, let's talk about colors and tiles!

Stay tuned!

Packages vNext: Power-up our .NET Builds

2014-08-14T02:49:00.000+11:00

In the sequel of my previous post "Managing multiple platforms in Visual Studio", having done lots of cross-platform development in .NET in the recent years both at work or for SharpDX (with platform specific assemblies, PCL, assemblies using native compiled code... etc) while trying to trick and abuse nuget and msbuild as much as possible, I have realized that in order to provide a smooth integration of "build packages", this require to be more tightly integrated at the core of a build system.

Unfortunately, we only have today a patchwork of this integration, still quite incomplete and far from what it could be, and this is hurting a lot our development process. We really need something brand new here: we have lots of inputs, usecases, and while it is of course not possible to cover every aspects of all build workflows, It is certainly possible to address most of the common issues we are facing today. Let's try to figure out where this could lead!

What is a platform?

Hey, looks like Wikipedia definition is quite good:

A computing platform is, in the most general sense, whatever pre-existing environment a piece of software is designed to run within, obeying its constraints, and making use of its facilities. Typical platforms include a hardware architecture, an operating system (OS), and runtime libraries.^[1]

So this could be:

Targeting different CPU: like x86/x64/ARM...
Targeting different OS: Windows Desktop, Windows Phone, Windows Store Apps, Android, iOS, XBoxOne OS, PS4...etc.
Targeting other specific HW through an existing API (the runtime libraries of the Wikipedia's definition), like GPU through OpenGL, Direct3D, Metal...etc.

How do we target a platform in .NET?

Here is the short story. Our day life is of course a bit more complex.

For the CPU part:

"Any CPU" is most of the time our time-saver (digression: why oh why "Any CPU" must be defined with a space in the solution and expected to be "AnyCPU" without a space in a xxproj?!)
But when we have to use some external native code (dlls), we have to "DllImport" these functions. Problem is: native code comes with target CPUs, so

Either the library we are using is on the OS. For example, Dllimport of Direct3D from a .NET application is transparent, as the OS is handling the x86/x64/ARM switch for us
Or using a custom external native dll:

Best case: We are lucky at being able to "LoadLibrary"(looking at you Windows RT/Store) to preload the x86/x64 dll, and then let the DllImport use the existing loaded dll
Lazy/lame case: Patching the environment PATH variable (not always working)
Worst case: We are forced to compile our application against x86/x64/ARM because the target platform doesn't support multiple CPU assemblies in the same package (doh! looking at you Windows RT/Store) or DllImport is not working (doh! Silverlight CLR on Windows Phone 8.0), even if 90% of our code could be AnyCPU and we just want a tiny dll function, we are good to compile/distribute 3 packages. That's our life...needless to say, painful.

For the OS and runtime libraries part:

If we are developing a library and lucky at not using any OS specific APIs (looking at you, FileStream, no longer portable because of the Windows RT/Store mess!), we can go with Portable Class Libraries (PCL). Of course, if we failed to compile to Any CPU, we are good for the next choice.
If we are developing an application (an exe, a dll activity...etc.) or a non PCL-friendly library, we are good to compile against specific tool-chains (the little msbuild files imported at the end of our xxproj, remember?) and assemblies

But wait, that's a little short on the real coding journey here: In order to develop, build and distribute cross-platform libraries/applications, we are often juggling through different processes and constraints:

Use external assemblies, libraries, tools

Most of the time by having an "external" or "deps" folders in our product repo, storing dlls for a specific version, or being able to recompile these dependencies from the sources from an internal repo. Care must be taken about versioning
Potentially integrating them in our build process (UsingTask, pre-source process, post-exe process, ILMerge...etc.)
Potentially using NuGet to get all-in-one packages

If we do so, be ready to accept xxproj to be messed up by nuget, in several places (see next part) and prepare to suffer after a package update with our VCS...etc.

Develop platform/specifc assemblies that requires platform specific projects (for desktop, for WinRT, for WP8.x, for Android, for iOS...) with potentially some cross-platforms parts (PCL) and sometimes with native code to compile and/or to link to.

Best case: we can build everything from a single solution (sln), and in some cross-platform cases, using the kind of tricks I described in my previous post.
Worst case: we need to handle different solutions for different platforms. Sometimes requiring to develop custom tools to synchronize projects between platforms
Depending on some defines, we could have different builds for the same platform (like debug with logs/release no logs... etc.)

Use a build system to compile our solution/projects, most of the time using msbuild

Potentially to develop custom msbuild targets and distribute them as part of our product

Distribute our work, potentially using NuGet or some installers

Prepare to manage custom PowerShell and msbuild target files in NuGet package if you have anything platform specific (like x86) like in SharpDX.targets used by the nuget package.

All our work is version controlled right? So every steps above can lead to some specific cases and annoyances (lock the sln, lock this csproj... hm, no, git era dude, merge conflict or die!)

So, we somewhat end-up with:

Best-case: We have a single PCL library. Go back home from work, kiss your family.
Social-case: We are publishing our PCL to nuget
Worst-case:We have (multiple CPU to support) x (multiple OS/Store Rules) x (multiple platform specifics APIs) assemblies to

develop (hey, #ifdef we still love you, you know)
build (hey, Condition="'$(XXX)'=='true'" is our friend, and oh, don't expect to avoid the msbuild's underground, msbuild is like our grandma, she still needs lots of love)
deploy (hey... hm, ok, I gave up, too many options for a one-liner "hey")

You may have had going through what is described here, you may have to handle much more worst cases than I can ever imagine, but... can we really improve things here?

Build packages vNext

As a preamble, a little note about NuGet. NuGet has been helping a lot in this area and is a super contribution in the develop/build/deploy chain, but NuGet has still to struggle with legacy builds, sometimes not NuGet fault, in particular:

We are still referencing lots of assemblies through the regular "Add Reference..." because they don't have nugets
NuGet is much more intrusive in the xxproj files than a simple "Add Reference": It has to store a relative paths (bad), and if the nuget package have target files, it needs to add some significant code to our xxproj (for example, in SharpDX)
NuGet still needs to add references to our packaged assemblies, so if our package "Dummy" has 50 Dummy.ABC.*.dll assemblies, we will see a lot of them in our "References"
NuGet doesn't have a probing path for looking for installed local assemblies, but needs to store the assembly references paths directly into the xxproj and forcing package storage (that can be configured in a nuget.config but still, no probing path). For example, if we move the project in a directory structure, it doesn't compile any more.
NuGet is not VCS friendly. Updating a version of a package can cause *lots* of updates in our xxproj: prey that nobody else is doing the same thing on the same project on a different branch.

Also

PCL are good because they are surfacing the API, exposing a lightweight cross-platform core.
We still need to live with platform specific assemblies

Note that ASP vNext is easing the definition of dependencies and simple project compilation, but it is failing at providing a fully unified and integrated build system that spans over the different problems when developing cross platforms packages with more complex builds.

So, we can somewhat improve the process here by unifying the old and new in a Package vNext concept.

A Package vNext would be pretty much as the NuGet package we have today and would contain:

A version number
All meta descriptions found in NuGet (Owners, Project urls...etc)
Dependencies to other packages/versions
A set of assemblies, compiled for different platforms (or a single platform if it is really platform specific).
Potentially a set of public properties/flags exposed by the package that could be set by the referencing project, and would allow to configure the way the link against this package (some specific assemblies or not...etc. depending on the platform...etc.)
Potentially PDBs with direct source code included in the package (but unlike NuGet, not stored on a PDB Symbol server)
Potentially documentation that would be automatically accessible from the IDE
Potentially user files to add to the current project
Potentially providing different additional build files (msbuild target files), transparently added to the build (but unlike today, not modifying the host msbuild files)
Potentially an install plugin helper (like powershell, but I would prefer a .NET interface/plugin system instead of the unfriendly powershell syntax)
Potentially providing IDE extensions (recognized by some IDE, that could provide specific IDE extension for VS or Xamarin Studio...etc.)
Working also for C++ package: providing includes, libs...etc. (and here, C++ would gain a *lot*)
Package could be signed (non-modifiable)

All our xxproj project (C#, VB, F#...etc.) would reference a package vNext (but usually not a path to package, though it could be possible in some cases), just like this:

<ItemGroup>
  <!-- Package loaded from probing paths -->
  <PackageReference Include=".NET" Version="4.0" />
  <PackageReference Include="YourPackage" Version="1.0" /> 

  <!-- Package loaded from probing paths but with the version defined at solution level -->
  <PackageReference Include="YourPackageSpecialVersion" /> 

  <!-- Package loaded from specific path -->
  <PackageReference File="path/to/location/FixedLocalPackage-1.0.0" />
</ItemGroup>
...
<Import Project="$(MSBuildToolsPath)\Microsoft.CommonvNext.targets" />

This is the only modification that would be required to reference a package. Everything else (target, custom tasks, files...etc.) would be automatically handled and integrated by the build system (here the CommonvNext.targets).

When we are targeting a platform specific application, or providing a PCL library, this should be only specified by some properties at the beginning of the project. We would not have to reference explicit targets/dlls in the project (currently, we need to have include CSharp.targets, or Xaml.targets, or WindowsPhone.targets...etc.) but handled by the build system.

The package version could be defined directly at:

the xxproj project level
the solution level, in order to avoid the multiplication of versions all around in all projects from a solution (like a sealed version that could not be override unless specified explicitely with an "override" attributes, exactly like in our languages)

A package local probing path would be used in the same way the PATH is used to locate native dlls. This probing path could be:

Provided by the system
Override locally at the solution level
Override locally at the project level

Like NuGet, It would be possible to query a remote probing path in order to automatically download missing packages.

Package vNext in the probing path would not be expanded/unzip to the disk, at least visible to the user. Instead they would stay just single plain files (unlike NuGet that is requiring to explicitly expand the packages in a "visible" folder). It is the build infrastructure that would take care to transparently unzip them in some places (for example, in a .vstmp folder at the root of a solution, easily ignorable from a VCS, or on a fixed central temp repo on the disk... etc.)

When compiling a project to target multiple platforms, the IDE should provide a way to easily identify which files is going to which platform from a xxxproj. This is a bit orthogonal to the Package vNext, but quite important to it if we want a project to target multiple platforms easily.
Packaging and publishing a Package vNext should be part of the build system, as for NuGet that is using nuspec files or directly xxproj files. It means that building a solution, or a project, would produce one or several package vNext directly consumable by other projects. A Package itself in a solution could contain one or several projects...etc. But a project would reference other packages, not projects.

Digression on implem of such a system with current msbuild system: One limitation of msbuild is that it cannot import a variable list of *.targets files, all this list must be known at compile time. But, a workaround would be that the build system would generate an intermediate build files (only used internally), exactly like it does for solution files (that are converted into a single msbuild files when building a solution).

With such a system, we would be able:

To develop a cross platform application from a single solution, and even from a same project able to target multiple platforms
To enhance the experience of working with libraries (core .NET framework, external libs...etc.) a unified system instead of having several systems/workarounds (add reference, target files, nuget packages)
To reduce the changes/friction on xxproj, when we are switching package versions...etc. leading to much more VCS friendly build system

A build dream to build!

Ok, let's face it: This post is describing a "nice to have" concept. It is always easy to write this scratching article, but way more difficult to implement it! When looking at NuGet source code, we can see that it is *lots* of work to provide this kind of infrastructure.

Still, I believe that a full integration of the notion of package is a key direction for developing, building and deploying cross platform/platform-specific applications in .NET and we should embrace it at the core of our build system.

So what do you think about this? I'm sure there are lots of ideas that could come to improve all this concept, please, share it!

Micro-benchmarking .NET Native and RyuJit

2014-06-16T01:21:00.000+11:00

[Disclaimer] As both RyuJIT and .NET Native are improving a lot between their updates/previews, results of benchmarks in this post may no longer be relevant. Benchmarks were ran with the following versions:

RyuJIT CTP4

.NET Native developer preview 2

While .NET JIT is performing quite well on Windows, it is still behind a fully optimized C++ program, though an efficient compiled program is not only about code generation but also memory management and data locality, the .NET Team has recently introduced two new technologies that might help for the code gen part: The introduction of .NET Native, an offline .NET compiler (similar to ngen, but using the backend optimizer that is used by the C++ compiler) and the next generation of .NET JIT called "RyuJit". In this post I would like to present the results of some micro-benchmarks that are roughly evaluating some performance benefits of these two new technologies.

First of all, you may have already read about a few benchmarks results available around about RyuJit and .NET Native, here is a non-exhaustive list I have found (if you have more pointers, let me know!):

A first look at RyuJIT CTP3 and SIMD (SSE2) support for .NET by Frank Niemeyer
Lies, damn lies and benchmarks by Kevin Frei
.NET Native Performance by Sasha Goldstein

The micro-benchmark protocol

Micro-benchmarking is not the best way to give a measure of the overall benefits, but it can help to dig into some particular patterns. For this benchmark, I haven't developed a new one but instead built a "freak-benchmark" composed of some micro-benchmarks I found on Internet, mainly:

"Head-to-head benchmark: C++ vs .NET" by Qwertie has a nice collection of micro-benchmarks. So I decided to use it as a basis
"A Collection of Phoenix-Compatible C# Benchmarks" I used the port of a subset of Java Grande benchmarks.
Two custom benchmarks measuring the cost of interop which is important in cases where you can possibly call lots of native methods (which is the case when using SharpDX for example)

RyuJit is also coming with SIMD, but I will reserve a dedicated post to test this new feature.

I don't claim that these micro-benchmarks are exhaustive nor they are all correctly implemented (some of the JavaGrande benchmark seems to be not robust), but as we are measuring relative performance, that should be fine. In the end we just want to know how much .NET Native or RyuJit can perform compare to the same program running on the legacy JIT.

Also as both .NET Native and RyuJit are in development, we can't really draw any definitive conclusions.

.NET Native is only available for Windows Store App while RyuJit is only available on x64, hence the platforms tested in this bench are:

.NET 32 Desktop
.NET 32 AppStore
.NET 32 AppStore Native
.NET 64 Desktop
.NET 64 AppStore
.NET 64 AppStore Native
.NET 64 Desktop RyuJit

.NET32/.NET64 using .NET Framework 4.5.1. The machine is an Intel(R) Core(TM) i7-4770 CPU @ 3.4GHz with 16Go of RAM.

The source of these benchmarks is available on GitHub BenchNativeApp.

Comparison .NET32 (x86)

Comparison between:

.NET 32 Desktop
.NET 32 AppStore
.NET 32 AppStore Native

Normalized with performance relative to desktop. Higher is better. (2.0 means that a test is 2 times faster than desktop). I checked the standard dev was in some reasonable range.

In green, results above +10%
In red, results below -10%

Name	.NET 32 (Desktop)	.NET 32 (AppStore)	.NET 32 Native (AppStore)
00-Big int Dictionary: 1 Adding items	1.00	0.90	0.72
00-Big int Dictionary: 2 Running queries	1.00	0.96	0.81
00-Big int Dictionary: 3 Removing items	1.00	0.96	0.92
01-Big string Dictionary: 0 Ints to strings	1.00	0.79	1.13
01-Big string Dictionary: 1 Adding/setting	1.00	0.91	1.02
01-Big string Dictionary: 2 Running queries	1.00	0.83	1.07
01-Big string Dictionary: 3 Removing items	1.00	0.85	1.05
02-Big int sorted map: 1 Adding items	1.00	1.01	0.89
02-Big int sorted map: 2 Running queries	1.00	1.01	0.90
02-Big int sorted map: 3 Removing items	1.00	1.01	0.77
03-Square root: double	1.00	1.00	1.00
03-Square root: FPL16	1.00	1.02	0.97
03-Square root: uint	1.00	1.03	0.97
03-Square root: ulong	1.00	1.01	0.94
04-Simple arithmetic: double	1.00	1.01	0.51
04-Simple arithmetic: float	1.00	1.01	0.86
04-Simple arithmetic: FPI8	1.00	0.99	1.23
04-Simple arithmetic: FPL16	1.00	1.01	1.55
04-Simple arithmetic: int	1.00	1.02	2.04
04-Simple arithmetic: long	1.00	0.99	0.83
05-Generic sum: double	1.00	1.01	3.34
05-Generic sum: FPI8	1.00	1.01	1.30
05-Generic sum: int	1.00	1.01	0.93
05-Generic sum: int via IMath	1.00	1.01	0.57
05-Generic sum: int without generics	1.00	1.00	1.10
06-Simple parsing: 3 Parse (x1000000)	1.00	1.00	1.00
06-Simple parsing: 4 Sort (x1000000)	1.00	1.00	1.09
07-Trivial method calls: Interface NoOp	1.00	1.00	0.71
07-Trivial method calls: No-inline NoOp	1.00	1.00	1.03
07-Trivial method calls: Static NoOp	1.00	1.00	no-op
07-Trivial method calls: Virtual NoOp	1.00	1.11	1.07
08-Matrix multiply: [n][n]	1.00	1.00	2.43
08-Matrix multiply: [n][n]	1.00	1.00	2.68
08-Matrix multiply: [n][n]	1.00	0.95	0.59
08-Matrix multiply: Array2D	1.00	1.01	0.64
08-Matrix multiply: double[n*n]	1.00	1.01	0.78
08-Matrix multiply: double[n][n]	1.00	1.01	1.04
08-Matrix multiply: int[n][n]	1.00	1.00	1.21
09-Sudoku	1.00	1.00	1.04
10-Polynomials	1.00	1.00	1.03
11-JGFArithBench	1.00	1.00	23.81
12-JGFAssignBench	1.00	0.80	1.20
13-JGFCastBench	1.00	1.00	1.27
14-JGFCreateBench	1.00	0.98	0.81
15-JGFFFTBench	1.00	1.01	0.99
16-JGFHeapSortBench	1.00	0.99	1.01
17-JGFLoopBench	1.00	1.00	1.04
18-JGFRayTracerBench	1.00	0.98	0.88
19-float4x4 matrix mul, Managed Standard	1.00	1.00	0.63
20-float4x4 matrix mul, Managed unsafe	1.00	1.01	0.96
21-float4x4 matrix mul, Interop Standard	1.00	1.23	1.42
22-float4x4 matrix mul, Interop SSE2	1.00	1.36	1.82
23-managed add	1.00	1.01	7.00
24-managed no-inline add	1.00	1.00	1.10
25-interop add	1.00	1.01	1.21
26-interop indirect add	1.00	1.00	2.26

Quick analysis

We would probably expect a column full of green lights for the .NET Native, but this is unfortunately not the case! Some notes:

.NET Native is as efficient as a C++ compiler at coalescing arithmetic instructions (test 11, or 23). Basically the test 23 is able to reduce the addition set of x+=1, x+=2, x+=-3, x+=1, x+=2, x+=-3, x+=1 to a single x+= 1, resulting in some impressive speedup. Coalescing of instructions is probably the factor that is helping in most tests there.
Some float/double x87 calculations seems to perform badly with .NET Native.
Pure interop seems slightly more efficient, which is good whenever we are frequently calling native functions (like when using SharpDX/Direct3D11). Note that indirect interop (wrapping a DllImport by another function) is also faster which is great, as It was an issue with current interop that were not inlined by the JIT when they are wrapped, resulting in lots of duplicate prologue/epilogue code for unmanaged/managed transitions (while when it is correctly inlined, consecutive access to interop functions are handled in group when switching context unmanaged/managed)
Some tests are 2x times slower with .NET Native, though I haven't look at the generated x86 code.

Overall it is still promising, we can see some significant boosts in some tests while some others are performing a bit worse.

Comparison .NET64 (x64)

Comparison between:

.NET 64 Desktop
.NET 64 AppStore
.NET 64 AppStore Native
.NET 64 Desktop RyuJit

Normalized with performance relative to desktop. Higher is better. (2.0 means that a test is 2 times faster than desktop)

In green, results above +10%
In red, results below -10%

Name	.NET 64 (Desktop)	.NET 64 (AppStore)	.NET 64 Native (AppStore)	.NET 64 RyuJit (Desktop)
00-Big int Dictionary: 1 Adding items	1.00	1.04	1.00	1.02
00-Big int Dictionary: 2 Running queries	1.00	0.91	1.00	0.95
00-Big int Dictionary: 3 Removing items	1.00	1.00	0.95	0.95
01-Big string Dictionary: 0 Ints to strings	1.00	0.69	0.72	1.00
01-Big string Dictionary: 1 Adding/setting	1.00	0.85	0.84	0.99
01-Big string Dictionary: 2 Running queries	1.00	0.82	0.90	0.95
01-Big string Dictionary: 3 Removing items	1.00	0.81	0.91	1.00
02-Big int sorted map: 1 Adding items	1.00	0.98	1.10	1.04
02-Big int sorted map: 2 Running queries	1.00	1.02	1.06	0.97
02-Big int sorted map: 3 Removing items	1.00	1.01	1.02	1.16
03-Square root: double	1.00	1.00	1.00	1.00
03-Square root: FPL16	1.00	1.01	1.15	1.03
03-Square root: uint	1.00	1.00	0.97	0.94
03-Square root: ulong	1.00	1.00	1.15	0.95
04-Simple arithmetic: double	1.00	1.00	4.20	1.10
04-Simple arithmetic: float	1.00	1.00	1.36	0.99
04-Simple arithmetic: FPI8	1.00	1.00	0.91	1.42
04-Simple arithmetic: FPL16	1.00	0.96	1.21	5.19
04-Simple arithmetic: int	1.00	1.00	0.83	0.89
04-Simple arithmetic: long	1.00	1.00	0.96	0.93
05-Generic sum: double	1.00	1.00	1.34	1.33
05-Generic sum: FPI8	1.00	1.00	1.29	1.00
05-Generic sum: int	1.00	0.98	1.28	0.99
05-Generic sum: int via IMath	1.00	1.00	0.65	0.99
05-Generic sum: int without generics	1.00	1.00	1.70	1.00
06-Simple parsing: 3 Parse (x1000000)	1.00	1.00	0.50	1.00
06-Simple parsing: 4 Sort (x1000000)	1.00	0.95	1.30	0.95
07-Trivial method calls: Interface NoOp	1.00	1.00	0.69	0.85
07-Trivial method calls: No-inline NoOp	1.00	0.92	0.96	0.96
07-Trivial method calls: Static NoOp	1.00	1.00	Not Applicable	0.20
07-Trivial method calls: Virtual NoOp	1.00	1.00	0.92	0.74
08-Matrix multiply: [n][n]	1.00	0.99	1.14	1.17
08-Matrix multiply: [n][n]	1.00	1.00	5.01	4.95
08-Matrix multiply: [n][n]	1.00	1.00	1.34	1.16
08-Matrix multiply: Array2D	1.00	1.00	3.83	2.75
08-Matrix multiply: double[n*n]	1.00	1.00	1.00	1.00
08-Matrix multiply: double[n][n]	1.00	0.99	0.96	0.98
08-Matrix multiply: int[n][n]	1.00	1.00	1.19	1.12
09-Sudoku	1.00	1.00	1.38	1.48
10-Polynomials	1.00	1.00	0.94	0.99
11-JGFArithBench	1.00	1.00	1.02	1.12
12-JGFAssignBench	1.00	1.00	1.02	0.53
13-JGFCastBench	1.00	1.00	0.99	1.39
14-JGFCreateBench	1.00	0.96	0.81	0.99
15-JGFFFTBench	1.00	1.16	1.18	1.16
16-JGFHeapSortBench	1.00	1.00	1.01	0.99
17-JGFLoopBench	1.00	1.00	1.08	1.01
18-JGFRayTracerBench	1.00	1.00	0.87	1.13
19-float4x4 matrix mul, Managed Standard	1.00	0.99	1.04	1.36
20-float4x4 matrix mul, Managed unsafe	1.00	1.01	0.90	1.00
21-float4x4 matrix mul, Interop Standard	1.00	1.20	1.46	1.03
22-float4x4 matrix mul, Interop SSE2	1.00	1.36	1.92	1.05
23-managed add	1.00	1.00	1.00	4.05
24-managed no-inline add	1.00	0.89	1.48	1.00
25-interop add	1.00	0.99	1.17	1.11
26-interop indirect add	1.00	1.00	1.28	0.38

Quick analysis

Slightly better than x86 code gen, the .NET Native x64 and RyuJit are on average performing better than their JIT counterpart. Some notes:

Unexpectedly, coalescing of arithmetic instructions (test 11, or 23) is not happening for .NET Native, but for RyuJit.
Performance on float/double is better. Most likely SSE registers are better used.
Sudoku tests is getting a nice +40-50% faster with .NET Native and RyuJit

Comparison .NET32 Native vs .NET64 Native

Just use .NET 32 Native as a reference (1.0) and compare it to the .NET 64 Native.
Normalized with performance relative to .NET 32 Native. Higher is better. (2.0 means that a test on x64 Native is 2 times faster than x86 )

In green, results above +10%
In red, results below -10%

Name	.NET 32 Native (AppStore)	.NET 32 vs 64 Native (AppStore)
00-Big int Dictionary: 1 Adding items	1.00	1.31
00-Big int Dictionary: 2 Running queries	1.00	1.35
00-Big int Dictionary: 3 Removing items	1.00	1.19
01-Big string Dictionary: 0 Ints to strings	1.00	1.13
01-Big string Dictionary: 1 Adding/setting	1.00	1.22
01-Big string Dictionary: 2 Running queries	1.00	1.17
01-Big string Dictionary: 3 Removing items	1.00	1.16
02-Big int sorted map: 1 Adding items	1.00	1.01
02-Big int sorted map: 2 Running queries	1.00	0.99
02-Big int sorted map: 3 Removing items	1.00	0.96
03-Square root: double	1.00	1.00
03-Square root: FPL16	1.00	2.10
03-Square root: uint	1.00	1.12
03-Square root: ulong	1.00	2.12
04-Simple arithmetic: double	1.00	2.07
04-Simple arithmetic: float	1.00	1.40
04-Simple arithmetic: FPI8	1.00	1.00
04-Simple arithmetic: FPL16	1.00	1.49
04-Simple arithmetic: int	1.00	0.97
04-Simple arithmetic: long	1.00	7.65
05-Generic sum: double	1.00	1.01
05-Generic sum: FPI8	1.00	1.00
05-Generic sum: int	1.00	1.01
05-Generic sum: int via IMath	1.00	1.07
05-Generic sum: int without generics	1.00	1.12
06-Simple parsing: 3 Parse (x1000000)	1.00	1.00
06-Simple parsing: 4 Sort (x1000000)	1.00	1.96
07-Trivial method calls: Interface NoOp	1.00	1.00
07-Trivial method calls: No-inline NoOp	1.00	1.20
07-Trivial method calls: Static NoOp	1.00	not applicable
07-Trivial method calls: Virtual NoOp	1.00	1.16
08-Matrix multiply: [n][n]	1.00	1.29
08-Matrix multiply: [n][n]	1.00	1.24
08-Matrix multiply: [n][n]	1.00	2.38
08-Matrix multiply: Array2D	1.00	1.99
08-Matrix multiply: double[n*n]	1.00	1.30
08-Matrix multiply: double[n][n]	1.00	1.00
08-Matrix multiply: int[n][n]	1.00	2.37
09-Sudoku	1.00	1.13
10-Polynomials	1.00	0.99
11-JGFArithBench	1.00	0.61
12-JGFAssignBench	1.00	0.86
13-JGFCastBench	1.00	1.57
14-JGFCreateBench	1.00	0.93
15-JGFFFTBench	1.00	1.05
16-JGFHeapSortBench	1.00	1.08
17-JGFLoopBench	1.00	0.97
18-JGFRayTracerBench	1.00	1.07
19-float4x4 matrix mul, Managed Standard	1.00	1.34
20-float4x4 matrix mul, Managed unsafe	1.00	1.02
21-float4x4 matrix mul, Interop Standard	1.00	1.09
22-float4x4 matrix mul, Interop SSE2	1.00	1.15
23-managed add	1.00	0.16
24-managed no-inline add	1.00	1.35
25-interop add	1.00	1.15
26-interop indirect add	1.00	1.15

Quick analysis

.NET 64 Native code gen is better than .NET 32 Native code gen. Haven't dig into code gen, but more registers for x64 might help optim while x86 is still fighting with a limited set of registers (and x86 code is not using SSE instructions, so it doesn't help). Good to see that interop is also better on x64, while it is not the case for JIT x64 where it is usually much slower.

Comparison .NET64 Native vs .NET64 RyuJit

Use .NET 64 Native as a reference (1.0) and compare it to the .NET 64 RyuJit.
Normalized with performance relative to .NET 64 Native. Higher is better. (2.0 means that a test on x64 RyuJit is 2 times faster than x64 Native )

In green, results above +10%
In red, results below -10%

Name	.NET 64 Native (AppStore)	.NET 64 vs 64 RyuJit
00-Big int Dictionary: 1 Adding items	1.00	1.02
00-Big int Dictionary: 2 Running queries	1.00	0.95
00-Big int Dictionary: 3 Removing items	1.00	1.00
01-Big string Dictionary: 0 Ints to strings	1.00	1.38
01-Big string Dictionary: 1 Adding/setting	1.00	1.18
01-Big string Dictionary: 2 Running queries	1.00	1.05
01-Big string Dictionary: 3 Removing items	1.00	1.10
02-Big int sorted map: 1 Adding items	1.00	0.94
02-Big int sorted map: 2 Running queries	1.00	0.91
02-Big int sorted map: 3 Removing items	1.00	1.14
03-Square root: double	1.00	1.00
03-Square root: FPL16	1.00	0.89
03-Square root: uint	1.00	0.97
03-Square root: ulong	1.00	0.83
04-Simple arithmetic: double	1.00	0.26
04-Simple arithmetic: float	1.00	0.73
04-Simple arithmetic: FPI8	1.00	1.56
04-Simple arithmetic: FPL16	1.00	4.31
04-Simple arithmetic: int	1.00	1.07
04-Simple arithmetic: long	1.00	0.96
05-Generic sum: double	1.00	0.99
05-Generic sum: FPI8	1.00	0.78
05-Generic sum: int	1.00	0.78
05-Generic sum: int via IMath	1.00	1.53
05-Generic sum: int without generics	1.00	0.59
06-Simple parsing: 3 Parse (x1000000)	1.00	2.00
06-Simple parsing: 4 Sort (x1000000)	1.00	0.73
07-Trivial method calls: Interface NoOp	1.00	1.24
07-Trivial method calls: No-inline NoOp	1.00	1.00
07-Trivial method calls: Static NoOp	1.00	0.00
07-Trivial method calls: Virtual NoOp	1.00	0.81
08-Matrix multiply: [n][n]	1.00	1.03
08-Matrix multiply: [n][n]	1.00	0.99
08-Matrix multiply: [n][n]	1.00	0.86
08-Matrix multiply: Array2D	1.00	0.72
08-Matrix multiply: double[n*n]	1.00	0.99
08-Matrix multiply: double[n][n]	1.00	1.02
08-Matrix multiply: int[n][n]	1.00	0.94
09-Sudoku	1.00	1.07
10-Polynomials	1.00	1.05
11-JGFArithBench	1.00	1.10
12-JGFAssignBench	1.00	0.52
13-JGFCastBench	1.00	1.40
14-JGFCreateBench	1.00	1.23
15-JGFFFTBench	1.00	0.98
16-JGFHeapSortBench	1.00	0.98
17-JGFLoopBench	1.00	0.93
18-JGFRayTracerBench	1.00	1.30
19-float4x4 matrix mul, Managed Standard	1.00	1.31
20-float4x4 matrix mul, Managed unsafe	1.00	1.12
21-float4x4 matrix mul, Interop Standard	1.00	0.71
22-float4x4 matrix mul, Interop SSE2	1.00	0.55
23-managed add	1.00	4.05
24-managed no-inline add	1.00	0.68
25-interop add	1.00	0.95
26-interop indirect add	1.00	0.30

Quick analysis

Surprisingly, RyuJit is performing quite well or sometimes even better than .NET 64 Native. Might be interesting to dig into this.

Summary

As both .NET Native and RyuJit are still in alpha/beta stages, we can't really assert any definitive conclusions here. We can see a trend of improvements in some specific areas, while some tests are still performing a bit worse than the legacy JIT. [Edit]The release of .NET Native Developer Preview 3 on June 30 2014, is showing some improvements in code gen, so .NET Native and RyuJIT are definitely being improved between updates and it is great! [/Edit]

It is good to see.NET 64 getting better and performing well with .NET Native and RyuJit, Until now I have been a bit reluctant at using it, but it looks more robust compare to x86 code gen.

While code gen can be undoubtedly improved with offline compilers or a more modern JIT like RyuJit, we probably can't expect the moon. As I said in this introduction, code gen is only a part of the overall performance cake. The other part, that is most likely not yet covered by these new compiler architectures, is data locality: things like ability to create fat objects - embed instantiation of objects instance into another instance - or creation of short live objects (not value types) on the stack instead of the heap are still areas where .NET can probably be improved. I will hopefully take more time in a next post to explain why this is an important area of improvements and what could be done.

Anyway, this is great to see .NET performance back into the ring! I'm eager also to be able to use .NET Native on desktop.

Managing multiple platforms in Visual Studio

2014-05-25T20:08:00.000+11:00

Who has not struggled to correctly manage multiple platform configurations in Visual Studio without ending to edit a solution file or tweak some msbuild files by hand? Recently, I decided to cleanup the antique SharpDX.sln in SharpDX that was starting to be a bit fat and not easy to manage. The build is not extremely bizarre there, but as it needs to cover the combinations of NetPlatform x OSPlatform x DirectXVersion x Debug/Release with around 40 projects (without the samples), it is an interesting case of study. It turns out that modifying the solution to make a clean multi-platform build was impossible without hacking msbuild in order to circumvent unfortunate designs found in Microsoft msbuild files (and later to found at work in Xamarin build files as well). In this post, we will go through the gotchas found, and we will see also why Visual Studio should really improve the configuration manager if they want to improve our developers experience.

Preliminaries

There are a couple of things to understand on how VisualStudio and msbuild are working with solution files and configuration. This is just a little overview about the key settings and how they affect your build. I found some good introduction about this in the post "Targeting Platforms in Visual Studio" worth a read.

If we look at a simple solution containing only a single project:

Microsoft Visual Studio Solution File, Format Version 12.00
# Visual Studio 2012
Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "TestConsole", "TestConsole\TestConsole.csproj", "{56849035-CEF7-446D-AF0A-51EE9DC1DDB7}"
EndProject
Global
 GlobalSection(SolutionConfigurationPlatforms) = preSolution
  Debug|Any CPU = Debug|Any CPU
  Release|Any CPU = Release|Any CPU
 EndGlobalSection
 GlobalSection(ProjectConfigurationPlatforms) = postSolution
  {56849035-CEF7-446D-AF0A-51EE9DC1DDB7}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
  {56849035-CEF7-446D-AF0A-51EE9DC1DDB7}.Debug|Any CPU.Build.0 = Debug|Any CPU
  {56849035-CEF7-446D-AF0A-51EE9DC1DDB7}.Release|Any CPU.ActiveCfg = Release|Any CPU
  {56849035-CEF7-446D-AF0A-51EE9DC1DDB7}.Release|Any CPU.Build.0 = Release|Any CPU
 EndGlobalSection
 GlobalSection(SolutionProperties) = preSolution
  HideSolutionNode = FALSE
 EndGlobalSection
EndGlobal

What we can see from the solution is that it defines:

In SolutionConfigurationPlatforms, the mapping between solution configuration/platforms to project configuration/platforms. When you read the line :

  Debug|Any CPU = Debug|Any CPU

It means that the Solution configuration/platform Debug|Any CPU will map to the project configuration/platform Debug|Any CPU.

The project configuration and platform are the actual values that will be used when using later the properties Configuration and Platform in the msbuild proj (csproj...etc.) as we can see it used by the TestConsole.csproj above:

  
  <PropertyGroup Condition=" '$(Configuration)|$(Platform)' == 'Debug|AnyCPU' ">
    <PlatformTarget>AnyCPU</PlatformTarget>
    <DebugSymbols>true</DebugSymbols>

The solution also defines in the section ProjectConfigurationPlatforms the projects that will be build for each solution configuration/platform, as well as a mapping to the actual project configuration/platform. In SharpDX, the configuration/platform in SharpDX.sln are configured like this:

  
GlobalSection(SolutionConfigurationPlatforms) = preSolution
  Debug|DIRECTX11_2 = Debug|DIRECTX11_2
  Debug|Net20 = Debug|Net20
  Debug|Net40 = Debug|Net40
  Debug|Win8 = Debug|Win8
  Debug|WP81 = Debug|WP81
  Debug|WP8-ARM = Debug|WP8-ARM
  Debug|WP8-x86 = Debug|WP8-x86
  Release|DIRECTX11_2 = Release|DIRECTX11_2
  Release|Net20 = Release|Net20
  Release|Net40 = Release|Net40
  Release|Win8 = Release|Win8
  Release|WP81 = Release|WP81
  Release|WP8-ARM = Release|WP8-ARM
  Release|WP8-x86 = Release|WP8-x86
EndGlobalSection

As you can see, we are just using different configuration/platforms in order to target multiple .NET framework, different DirectX version and specifics OSes. But surprisingly, if you are trying to use this kind of configuration in your solution, It will not work out of the box.

Problem #1: Where is the solution platform?

By default, Visual Studio settings in C# is hiding the solution platform. Instead, what you will get is only the solution configuration:

This is really annoying, because if someone just open your solution, It will not realize that there are actually different platforms. The solution will just select the first defined platform. In order to get back the solution platform selector in Visual Studio, you need to activate back the button by selecting on the right side of the solution configuration the drop-down button "Add or Select buttons":

While I understand the ergonomic original reasons for hiding this button, in the era of multiple platform development, this should no longer be hidden and the default should show it. I hope that Visual Studio will fix this in a future release.

Problem #2: Project Platform semantic

This is the problem that made the refactoring of the SharpDX.sln quite laborious to hack. On the surface, solution platforms look nice. They provide a way to organize your project to target multiple platforms/configurations from the same solution. On the backside, it is not working as expected, mainly because some msbuild files are interpreting the value of the project platform.

And this is where I would like to take the opportunity here to explain why project platforms should have no semantic values for Visual Studio or Xamarin build files. Project platforms should be considered as user defined platforms, they are a way to organize our project in whatever combinations and these semantics should be owned by the developer of the project.

Unfortunately, Visual Studio msbuild files don't allow to use a custom project platform because they are expecting some specific platforms. For example, if you are developing a Windows Store Apps, you will find that a Windows Store Apps project won't compile if the platform is different from "Any CPU/x86/x64//Win32/arm"!. This is hardcoded in the file C:\Program Files (x86)\MSBuild\Microsoft\VisualStudio\v11.0\AppxPackage\Microsoft.AppxPackage.Targets line 1270 like this (Windows Phone platform and Xamarin are suffering the same problem):

<PropertyGroup>
 <_ProjectArchitectureOutput>Invalid</_ProjectArchitectureOutput>
 <_ProjectArchitectureOutput Condition="'$(Platform)' == 'AnyCPU'">neutral</_ProjectArchitectureOutput>
 <_ProjectArchitectureOutput Condition="'$(Platform)' == 'x86'">x86</_ProjectArchitectureOutput>
 <_ProjectArchitectureOutput Condition="'$(Platform)' == 'Win32'">x86</_ProjectArchitectureOutput>
 <_ProjectArchitectureOutput Condition="'$(Platform)' == 'x64'">x64</_ProjectArchitectureOutput>
 <_ProjectArchitectureOutput Condition="'$(Platform)' == 'arm'">arm</_ProjectArchitectureOutput>
</PropertyGroup>

Using directly the Platform from a core VisualStudio msbuild file is a mistake (same for Configuration, that is used in some Visual Studio msbuild targets), as it is forcing the original solution to use only these platforms. Instead, build files from Visual Studio should use a property that can be redefined by the project (like the property PlatformTarget that is used by the C# compiler). We should have a way to redefine the mapping in whatever way we would like. In other words, Solution platform and configurations should be fully owned by the developer of the solution. Their semantics are project specific and Visual Studio should allow us to define the remapping to a target platform (like AnyCPU) in our project like this:

  
  <PropertyGroup Condition=" '$(Configuration)|$(Platform)' == 'Debug|ThisIsMyConfig' ">
    <PlatformTarget>AnyCPU</PlatformTarget>
    ...

Fortunately, there is a hack to manage this, though it is not completely safe. By default, the properties Platform and Configuration are immutable in msbuild, because they are considered as global properties passed to msbuild, so they cannot be modified. But there is a way to override the platform "ThisIsMyConfig" to "AnyCPU" for some specific build (like WindowsStoreApps). In SharpDX, this is made possible by the target "SharpDXForcePlatform" as can be seen in this file. In order to work, the trick is:

Add a target that will be executed automatically whenever there is a build. This is done by declaring a msbuild project with the attribute InitialTargets="YourTargetToForcePlatform"

In the YourTargetToForcePlatform, we can override the Platform property programmatically (they are mutable only when using this trick from a target). In the following code, we are remapping the Platform Win8 to AnyCPU like this:

  
<Target Name="SharpDXForcePlatform">
    <!--
Windows 8 App Store => AnyCPU
Windows Phone 8.1 => AnyCPU
-->
    <CreateProperty Condition=" '$(Platform)' == 'Win8' or '$(Platform)' == 'WP81'" Value="AnyCPU">
      <Output
          TaskParameter="Value"
          PropertyName="Platform" />
    </CreateProperty>

This way, when the build start, the Platform property is correctly setup for the platform being compiled. Beware that the property Platform used outside a target (in property groups...etc.) is still linked to the original semantic which is actually good. But if a Visual Studio build is using the property Platform outside a target, this trick will not work.

So bottom line of this problem is that Visual Studio builds should really take care of this and avoid forcing any semantic for the configuration/platform. Without this, we are forced to use the hack described above or worst, to duplicate the solution (this was the case for SharpDX, which made the full build quite a pain).

Problem #3: The unwanted Mixed Platforms

When you are using custom platforms names, and you want to add a new project to your solution, you will most likely end-up with a new solution platform Mixed Platforms. This is really annoying when we are already dealing with multiple platforms, we don't want Visual Studio to add a useless platform. The solution is to remove it by hand in the .sln, but we should not have to do this. At worst Visual Studio should ask the developer "Do you really want to add a new mixed platform to your solution?", at best, remove this Mixed Platforms.

Problem #4: The Configuration Manager

When managing several platforms with several dozens of projects, the configuration manager is a real pain to use, and we are always forced to edit the sln by hand and perform some regexp replace on the file to cleanup it or to fix it.

There are lots of issues with the current Configuration Manager:

The window is not resizable ! If you have more than 12 projects in your solution, you are good to use the scrollview quite a lot.
It is not possible to have a global view of all your projects and which one is activated for which platforms...etc. Considering that you need to check (Debug AND Release) x number of platforms, and you have go around for a while by clicking, waiting, clicking, scrolling... a nightmare!
It is not possible to bulk edit your projects. You have to go though each single project, single click, dropdowns...etc. for each projects.
Switching configuration or platform is slow when you have lots of projects (or some custom .targets). I don't understand why Visual Studio seems to reevaluate all the projects, so it can take 2-3 seconds when switching the configuration/platform while everything should be already accessible from memory (both solution and projects)

vNext

Whoever has done some cross platform development (even just inside .NET, by targeting different .NET framework) with Visual Studio will most likely have struggled with the issues describe above.

With the rise of Xamarin more tightly integrated into Visual Studio, more development targeting all Windows eco-system (Windows Desktop, Windows AppStore, Windows Phone Store) and Android/iOS, all these issues should be really fixed to improve our productivity. Fingers crossed for VS2014 if someone at the Visual Studio team is reading this!

How do you manage these issues in your projects? Do you have any other ideas to improve the situation when targeting multiple platforms in Visual Studio?

PS: I will have to double check whether there is some uservoice or connect bugs for the issues described in this post. If you have any link already, I'm interested!

Advanced HLSL II: Shader compound parameters

2013-04-20T03:21:00.000+11:00

A very short post in the sequel of my previous post "Advanced HLSL using closures and function pointers", there is again a little neat trick by using the "class" keyword in HLSL: It is possible to use a class to regroup a set of parameters (shader resources as well as constant buffers) and their associate methods, into what is called a compound parameter. This feature of the language is absolutely not documented, I discovered the name "compound parameter" while trying to hack this technique, as the HLSL compiler was complaining about a restriction about this "compound parameter". So at least, It seems to be implemented up to the point that it is quite usable. Let's see how we can use this...

Group of input parameters in shaders, the usual way

Suppose the following code (not really useful):

// Shader Resources
SamplerState PointClamp;

// First set of parameters
// -----------------------
Texture2D<float> DepthBuffer;
float2 TexelSize;

// Associated methods with these parameters
float SampleDepthBuffer(float2 texCoord, int2 offsets = 0)
{
  return DepthBuffer.SampleLevel(PointClamp, texCoord + offsets * TexelSize, 0.0);
}

// Second set of parameters
// ------------------------
Texture2D<float> DepthBuffer1;
float2 TexelSize1;

// Associated methods with these parameters
float SampleDepthBuffer1(float2 texCoord, int2 offsets = 0)
{
  return DepthBuffer1.SampleLevel(PointClamp, texCoord + offsets * TexelSize1, 0.0);
}

float4 PSMain(float2 texCoord: TEXCOORD) : SV_TARGET
{
   return float4(SampleDepthBuffer(texCoord, int2(1, 0)), SampleDepthBuffer1(texCoord, int2(1, 0)), 0, 1);
}

What we have is some parameters that are grouped, for example

A resource DepthBuffer
A TexelSize that gives the size of a texel in uv coordinates for the previous textures (float2(1/width, 1/height))
A method "SampleDepthBuffer" that will sample the depth buffer.

And this set of parameters is duplicate with another set with just the postfix number "1". We need to duplicate the code here. Though of course, as usual there are some workaround

Either by using the preprocessor and token pasting: this approach is often used, but It means that you have a code that is sometimes less readable, especially if you have to embed a function in a #define.
For the methods SampleDepthBuffer, It could be possible to rewrite the signature to accept a Texture2D as well as a TexelSize as a parameter. Of course, if this function was using more textures, more parameters, we would have to pass them all by parameters...

The generated code produced by fxc.exe HLSL compiler is like this:

//
// Generated by Microsoft (R) HLSL Shader Compiler 9.29.952.3111
//
//
//   fxc /Tps_5_0 /EPSMain test.fx
//
//
// Buffer Definitions:
//
// cbuffer $Globals
// {
//
//   float2 TexelSize;                  // Offset:    0 Size:     8
//   float2 TexelSize1;                 // Offset:    8 Size:     8
//
// }
//
//
// Resource Bindings:
//
// Name                                 Type  Format         Dim Slot Elements
// ------------------------------ ---------- ------- ----------- ---- --------
// PointClamp                        sampler      NA          NA    0        1
// DepthBuffer                       texture   float          2d    0        1
// DepthBuffer1                      texture   float          2d    1        1
// $Globals                          cbuffer      NA          NA    0        1
//
//
//
// Input signature:
//
// Name                 Index   Mask Register SysValue Format   Used
// -------------------- ----- ------ -------- -------- ------ ------
// TEXCOORD                 0   xy          0     NONE  float   xy
//
//
// Output signature:
//
// Name                 Index   Mask Register SysValue Format   Used
// -------------------- ----- ------ -------- -------- ------ ------
// SV_TARGET                0   xyzw        0   TARGET  float   xyzw
//

ps_5_0
dcl_globalFlags refactoringAllowed
dcl_constantbuffer cb0[1], immediateIndexed
dcl_sampler s0, mode_default
dcl_resource_texture2d (float,float,float,float) t0
dcl_resource_texture2d (float,float,float,float) t1
dcl_input_ps linear v0.xy
dcl_output o0.xyzw
dcl_temps 1
mad r0.xyzw, cb0[0].xyzw, l(1.000000, 0.000000, 1.000000, 0.000000), v0.xyxy
sample_l_indexable(texture2d)(float,float,float,float) r0.x, r0.xyxx, t0.xyzw, s0, l(0.000000)
sample_l_indexable(texture2d)(float,float,float,float) r0.y, r0.zwzz, t1.yxzw, s0, l(0.000000)
mov o0.xy, r0.xyxx
mov o0.zw, l(0,0,0,1.000000)
ret
// Approximately 6 instruction slots used

When we have to deal with lots of parameters that are grouped, and these groups need to be duplicated with their associated methods, It becomes almost impossible to maintain a clean and reusable HLSL code. Fortunately, the "class" keyword is here to the rescue!

Shader input compound parameters container, the neat way

Let's rewrite the previous code using the keyword "class":

SamplerState PointClamp;

// Declare a container for our set of parameters
class TextureSet
{
    Texture2D<float> DepthBuffer;
    float2 TexelSize;

    float SampleDepthBuffer(float2 texCoord, int2 offsets = 0)
    {
        return DepthBuffer.SampleLevel(PointClamp, texCoord + offsets * TexelSize, 0.0);
    }
};

// Define two instance of compound parameters
TextureSet Texture1;
TextureSet Texture2;

float4 PSMain2(float2 texCoord: TEXCOORD) : SV_TARGET
{
    return float4(Texture1.SampleDepthBuffer(texCoord, int2(1, 0)), Texture2.SampleDepthBuffer(texCoord, int2(1, 0)), 0, 1);
}

And the resulting compiled HLSL is slightly equivalent:

//
// Generated by Microsoft (R) HLSL Shader Compiler 9.29.952.3111
//
//
//   fxc /Tps_5_0 /EPSMain2 test.fx
//
//
// Buffer Definitions:
//
// cbuffer $Globals
// {
//
//   struct TextureSet
//   {
//
//       float2 TexelSize;              // Offset:    0
//
//   } Texture1;                        // Offset:    0 Size:     8
                                        // Texture:   t0
//
//   struct TextureSet
//   {
//
//       float2 TexelSize;              // Offset:   16
//
//   } Texture2;                        // Offset:   16 Size:     8
                                        // Texture:   t1
//
// }
//
//
// Resource Bindings:
//
// Name                                 Type  Format         Dim Slot Elements
// ------------------------------ ---------- ------- ----------- ---- --------
// PointClamp                        sampler      NA          NA    0        1
// Texture1.DepthBuffer              texture   float          2d    0        1
// Texture2.DepthBuffer              texture   float          2d    1        1
// $Globals                          cbuffer      NA          NA    0        1
//
//
//
// Input signature:
//
// Name                 Index   Mask Register SysValue Format   Used
// -------------------- ----- ------ -------- -------- ------ ------
// TEXCOORD                 0   xy          0     NONE  float   xy
//
//
// Output signature:
//
// Name                 Index   Mask Register SysValue Format   Used
// -------------------- ----- ------ -------- -------- ------ ------
// SV_TARGET                0   xyzw        0   TARGET  float   xyzw
//

ps_5_0
dcl_globalFlags refactoringAllowed
dcl_constantbuffer cb0[2], immediateIndexed
dcl_sampler s0, mode_default
dcl_resource_texture2d (float,float,float,float) t0
dcl_resource_texture2d (float,float,float,float) t1
dcl_input_ps linear v0.xy
dcl_output o0.xyzw
dcl_temps 1
mad r0.xy, cb0[0].xyxx, l(1.000000, 0.000000, 0.000000, 0.000000), v0.xyxx
sample_l_indexable(texture2d)(float,float,float,float) r0.x, r0.xyxx, t0.xyzw, s0, l(0.000000)
mov o0.x, r0.x
mad r0.xy, cb0[1].xyxx, l(1.000000, 0.000000, 0.000000, 0.000000), v0.xyxx
sample_l_indexable(texture2d)(float,float,float,float) r0.x, r0.xyxx, t1.xyzw, s0, l(0.000000)
mov o0.y, r0.x
mov o0.zw, l(0,0,0,1.000000)
ret
// Approximately 8 instruction slots used

There are a couple of things to highlight:

The main difference is the packing of constant buffer variable is done separately as they will be packed together - as a struct and aligned on a float4 boundary. So in this specific case, the two floats TexelSize cannot be swizzled/merged (if they were float4, the code would be strictly equivalent). So we need to be aware and careful about this behavior.
Input resources are nicely prefixed by their compound parameter name, like "Texture1.DepthBuffer" or "Texture2.DepthBuffer", so it is also really easy to access them when using named resource bindings in an effect. Note that a resource declared but unused inside a compound parameter will occupy a slot register without using it (This is not a big deal, as there is almost the same kind of behavior when using array of resources)
We can still enclose "TextureSet Texture1" into a constant buffer declaration, the variable defined inside TextureSet for the Texture1 instance will correctly end-up in the corresponding constant buffer.
Global variable are accessible from methods defined in a compound parameter (for example PointClamp SamplerState used by the SampleDepthBuffer method)
Compound parameters can only be compiled using SM5.0 (unlike the previous post about the closures).

This is really a handy feature that could help to better organize some of our shaders. It's always surprising to still discover this kind of syntax constructions accessible from the current HLSL compiler. Let me know if you find any issues using this trick!

Going Native 2.0, The future of WinRT

2012-08-08T18:59:00.001+11:00

In the recent years, we have seen lots of fuzz about the return of “Going native” after the managed era popularized by Java and .NET. When WinRT was revealed last year, there was some shortsighted comments to claim that “.NET is dead” and to glorify the comeback of the C++, the true and only real way to develop an application, while at the same time, JIT was being more and more introduced in the scripted world (JavaScript being one of the most prominent JIT user). While in the end, everything is going native anyway - the difference being the length of the path to go native and how much optimized it will be - the meaning of the “native” word has slightly shifted to be strongly and implicitly coupled with the word “performance”. Even being a strong advocator for managed language, the performance level is indeed below a well written C++ application, so should we just accept this fact and get back to work with C++, with things like WinRT being the backbone of the interop? To tell you the truth, I want .NET to die and this post is about why and for what.

The Managed Era

Let’s just begin by revisiting recent history of managed development that will highlight current challenges. Remember the Java slogan? “write once runs everywhere”, it was the introduction of a paradigm where a complete “safe” single language-stack based on a virtual machine associated with a large set of API would allow to easily develop an application and target any kind of platforms/OS. It was the beginning of the “managed” era. While Java has been quite successfully adopted in several development industries, it was also quite rejected by lots of developers that were aware of memory management caveats and the JIT not being as optimized as it should be (though they did some impressive improvements over the years) with also a tremendous amount of bad design choice, like the lack of native struct, unsafe access or the route to go native through JNI extremely laborious and inefficient (and even recently, that they were considering to get rid off all native types and make everything an object, what a terrible direction!).

Java failed also in the heart of his slogan: it was in fact not possible to embrace in a single unified API all the usage of each target platforms/OS, leading to things like Swing, not what can be called an optimal UI framework. Also, from the beginning, Java was only design with a single language in mind, though lots of people found JIT/bytecode as an opportunity to port scripting languages to Java JVM.

In the meantime of early Java, Microsoft that tried to enter the Java market by integrating some custom language extensions (with the end story we know) and finally came with their own managed technology, which was in several aspects better conducted and designed: from the ground bytecode, unsafe construct, native interop, lightweight but very efficient JIT + NGEN, C# rapid language evolution, C++/CLI... etc, taking multiple language interop into account from the beginning and without the burden of the Java slogan (though Silverlight on MacOS or Moonlight were a good try).

Both systems share a similar managed monolithic stack: metadata, bytecode, JIT and GC are tightly coupled. Also performance wise, it is far from being perfect: the JIT is implying a startup cost and the executing code is not as fast as it should mainly because:

The JIT is performing poor optimization compare to full C++ -O2, because it needs to be fast when generating code (also, unlike Java hotspot JVM, .NET JIT is not able to hot swap existing JIT code by a better optimized code)
Managed types, like Array access are always checking bounds (apart for simple loops where the JIT can suppress the check if the for-limit is less or equal the array’s length)
GC can pause all threads to collect objects (though new GC in 4.5 made some improvements) which can cause unexpected slow down in an application.

But even with this performance deficiency, a managed eco-system with its comprehensive Framework is the king of productivity and language interop, with a descent overall performance for all languages running inside it. The apogee of the managed era was probably around the launch of Windows Phone and Visual Studio 2010 (using WPF for its rendering, though WPF is also built on top of lots of native code), where managed languages were the only authorized way to develop an application. That was not the best thing that could happen, considering the long list of pending issues with .NET performance, enough to stimulate all the “native coders” to strike back, and they were absolutely in their rights.

It turns out that somewhat it signs the "decline" of .NET. I don’t know much about Microsoft organization internals, but what is commonly reported is that there is some serious competition between divisions, good or bad, but for .NET, for the past few years, Microsoft seemed to running out of gas (for example, almost no significative improvements in the JIT/NGEN, lots of pending request for performance improvements, including things like SIMD that were asked for a long time), and my guess is that the required changes could only take place in a global strategy, with deep support and commitment from all divisions.

In the mean time, Google was starting to push its NativeClient technology, allowing to run sandboxed native code from the browser. Last year, in this delirium trend of going native, Microsoft revealed that even HTML5 implemented in next IE was going native! Sic.

In "Reader Q&A: When will better JITs save managed code?" Herb Sutter, one of the "Going Native" evangelist, provides some interesting insights about what the "Going Native" philosophy is thinking about JIT, with lots of inaccurate facts, but lets just focus on the main one : Even if JIT could improve in the future, managed languages made such a choice of safety over performance, that they are intrinsically doomed to not play in the big leagues. Miguel de Icaza posted a response about it in "Can JITs be faster?" and he explained lots of relevant things about why some of Herb Sutter statements were misleading.

Then WinRT came here to somewhat smooth the lines. By taking part of the .NET philosophy (metadata and some common “managed” types like strings and arrays) and the good old COM model (for a common denominator of native interop), WinRT is trying to solve the problem of language interoperability outside the CLR world (thus without the performance penalties for C++) and to provide a more “modern” OS API. Is this the definitive answer, the one that will rule them all? So far, not really, it is on the direction of the certain convergence that could lead to great things, but it is still uncertain that it will take the right track. But what could be this “right track”?

Going native 2.0, Performance for All

Though safety rules can have a negative impact on performance, managed code is not doomed to be run by poor JIT compiler (For example, Mono is able to run C# code natively compiled through LLVM on iOS/Linux) and it would be fairly easy to extend the bytecode with more "unsafe" levels to provide fine grained performance speedup (like suppressing array bounds checking...etc.).

But the first problem that can be currently identified is the lack of a strong cross-language compiler infrastructure, this is ranging from the compiler used in IE10 Javascript JIT, to the .NET JIT and NGEN compilers or into the Visual C++ compilers (to name a few), all using different code for almost the same kind of laborious and difficult problem of generating efficient machine code. Having a single common compiler is a very important step to provide a high performance code accessible from all languages.

Felix9 on Channel9 found that Microsoft could be actually working on this problem, so that's a good news, but the problem of the "performance for all" is a small part of a bigger picture. In fact the previous mentioned "right track" is a broader integrated architecture, not only an enhanced LLVM stack, but baked by Microsoft's experience in several fields (C++ compiler, JIT, GC, metadatas... etc), a system that would expose a completely externalized and modularized “CLR” composed of:

An intermediate mid level language, entirely queriable/reflectionnable, very similar to LLVM IR or .NET bytecode, defining common datatypes (primitives, string, array... Etc). An API similar to System.Reflection.Emit should be available. Vectorized types (SIMD) should be first class types as int or double are. This IL code should not be limited to CPU target usage, but should allow GPU computing (similar to AMP) : it should be possible to express HLSL bytecode with this IL, with the benefits to leverage on a common compiler infrastructure (see following points). Typeless IL should also be possible to allow dynamic languages to be expressed more directly.
A dynamic linked library/executable, like assemblies in .NET, providing metadatas, IL code, query/reflection friendly. When developing, code should be linked against assemblies/IL code (and not against crappy C/C++ headers).
An IL to native code compiler, which could be integrated in a JIT, an offline or a cloud compiler, or a mixed combination. This compiler should provide vectorization whenever target platform support it. IL code would be compiled to native code at install/deploy time, based on the target machine architecture (at dev time, it could be done after the whole application has been compiled to IL). The compiler stages should be accessible from an API and offer extension points as much as possible (providing access to IL to IL optimization, or to provide pluggable IL to native code transform). The compiler would be responsible to perform global program optimization at deploy time (or at runtime for JIT scenarios). Optimizations options should range from fast compilation (like JIT) to aggressive (offline, or hot swap code in a JIT). A profile of the application could also be used to automatically tune localized optimizations. This compiler should support advanced JIT scenarios, like dynamic hotspot analysis and On Stack Replacement (aka OSR, allowing heavy computation code to be replaced at runtime by a better optim code), unlike current .NET JIT that only compiles a method on a 1st run. This kind of optimization are really important in dynamic scenarios where type inference is sometimes discovered later (like Javascript).
An extensible allocator/memory component, allowing concurrent allocators, where the Garbage Collector/GC would be one implementation, though a major part of applications would use it to manage most of their lifecycle objects, leaving the most performance critical objects to be managed by other allocator schemes (like reference counting scenarios used by COM/WinRT). There is no restrictions to use different allocator models in a same application (and this is already what's happening when in a .NET application we need to deal with native interop to allocate objects using OS functions).

The philosophy is very similar to a CLR stack, however it doesn't force an application to be ran by a JIT compiler (yes there is NGEN in .NET, but it was designed for startup reasons, not for high performance reasons, plus it is a black box only working on assemblies installed into the GAC) and it allows mixed memory allocation GC/non-GC scenarios.

In this system, full native interoperability between languages would then be straightforward without sacrifying performance over simplicity and vice-verca. Ideally, an OS should be built from the ground up with such a core infrastructure. This is what was (is?) probably behind a project like Redhawk (for the compiler part), or Midori (for the OS part), in such an integrated system, probably only drivers would require some kind of unsafe behaviors.

[Update 9 Aug 2012: Felix9 again found that an intermediate bytecode, more low level than MSIL .NET bytecode, called MDIL could be already in used, and that could be the intermediate bytecode mentioned just above, though looking at the related patent "INTERMEDIATE LANGUAGE SUPPORT FOR CHANGE RESILIENCE", there are some native x86 registers in the specs that don't fit well with an architecture independent bytecode. Maybe they would keep MSIL as-is and leverage on a lower level MDIL. We will see.].

So what WinRT is tackling in this big picture? Metadatas, a bit of sandboxes API and an embryo of interoperability (through common datatypes and metadatas), as we can see, not so much, a basic COM++. And as we can obviously realize, WinRT is not able to provide advanced optimizations in scenarios where we use a WinRT API: for example, we cannot have a plain structure that can expose inlinable methods. Every method calls in WinRT are virtual calls, forced to go through a vtable (and sometimes several virtual calls are needed, when for example a static method is used), so even a simple property get/set will go through a virtual call. This is clearly inefficient. It looks like WinRT is only targeting coarse level API, leaving all the fine grained level API at the mercy of performance heterogeneity, restricting common scenarios where we want to access high performance code everywhere, without going through a layer of virtual calls and non-inlinable code. Using an extended COM model is not what we can call “Building the Future”.

Productivity and Performance for C# 6.0

A language like C# would be a perfect candidate in such a modular CLR system, and could be mapped easily to the previous intermediate bytecode. Though to efficiently use such a systen, C# should be improved on several aspects:

More unsafe power where we could turn off “managed” behaviors like array access checking (kind of “super unsafe mode”, where we could possibly use CPU pre-caching instructions before accessing next array elements, kind of "advanced" stuff impossible to do with current managed arrays without using unsupported tricks)
A configurable new operator that would integrate different allocator schemes.
Vectorized types (like HLSL float4) should be added to the core types. This has been asked for a long time (with ugly patches in XNA WP to "solve" this problem).
Lightweight interop to native code (in the case we would still be calling native code from C# unlike in an integrated OS): current manage to unmanaged transition is costly when calling native methods even without any "fixed" variables. An unsafe transition should be possible without the burden of the current x86/x64 prologue/epilogue of the unmanaged transition generated by current .NET JIT.

From a general language perspective, not strictly related to performance, there are lots of small area that would be important to be addressed as well:

Generics everywhere (in constructors, in implicit conversions) with more advanced constructs (contracts on operators... etc), closer to C++ template versatility but safer and less cluttered.
Struct inheritance and finalizers (to allow lightweight code to be executed on exit of a method, without going through the cumbersome "try/finally" or "using" patterns).
Add more MetaProgramming: allow static method extensions (not only for "this"), allow class mixin (mixin the content of a class inside another, usefull for things like math functions), allow modification of class/types/methods construction at compile time (for example, methods that would be called at compile time to add method/properties to a class, very similar to eigenclass in Ruby meta-programming instead of using things like T4 template code generation), more extensively, allow DSL like syntax language extensions at several points into the C# parser (Roslyn doesn't provide currently any extension point inside the parser) so that we could express language extensions in C# as well (for example, instead of having Linq syntax hardcoded, we should be able to write it as an extension parser plugin, fully written in C#). [Edit] I have posted a discussion "Meta-Programming and parser extensibility in C# and Roslyn" about what is intended behind this meta-programming thing at the Microsoft Roslyn forum. Check it out![/Edit]
A builtin symbol or link type where we could express a link to a language object (a class, a property, a method) by using a simple construction like: symbol LinkToMyMethod = @MyClass.MyMethod; instead of using Linq expressions (like (myMethod) => MyMethod inside MyClass). This would make more robust code using INotifyPropertyChanged or simplify all property based systems like WPF (which is currently an ugly duplication of the method definition).

Bottom line, is that there is less to add to C# than there is to remove from C++ to fully leverage on such a system and to greatly improve developer’s productivity, again without burning efficiency. One could argue that C++ already offers all of this and much more, but this is exactly why C++ is so much cluttered (syntax wise) and dangerous for the vast majority of developers. It allows unsafe everywhere, while unsafe code is always localized in an application (and is always source of memory corruption, so it is much easier to fix if they are clearly identified and strictly localized in the code, same than using asm keyword in non standard C/C++). It is easier and safer to track exceptional usages in a large codebase than to have it allowed everywhere.

Next?

We can hope that Microsoft took a top-down approach, by addressing unified OS API for all languages and simple interoperability first, and that they will introduce these more advanced features in later version of their OS. But this is an ideal expectation and it will be interesting to follow if Microsoft will effectively challenge this. Even if It was recently revealed that WP8 .NET applications would benefit some Cloud compilers, so far, we don't know much about it: Is it just a repackaging of NGEN (which is again, not performance oriented, generating code very similar to current JIT) or a non public RedHawk compiler?

Microsoft has lots of gold in their backyard, with years of advanced native code compilations with their C++ compiler, JIT, GC, and all the related R&D projects they have...

So to summarize this post: .NET must die to a better integrated, performance oriented, common runtime where the managed (safety/productivity) vs native (performance) is no longer a border, and this should be a structural part of next WinRT architecture evolution.

Advanced HLSL using closures and function pointers

2011-11-24T23:09:00.001+11:00

Shader languages like HLSL, Cg or GLSL are nowadays driving the most powerful processors in the world, but if you are developing with them, you may have been already a little bit frustrated by one of their expressiveness limitations: the common problem of abstraction and code reuse. In order to overcome this problem, solutions so far were mostly using a glue combination of #define/#include preprocessors directives in order to generate combinations of code, permutation of shaders, so called UberShaders. Recently, this problem has been addressed, for HLSL (new in Direct3D11), by providing the concept of Dynamic Linking, and for GLSL, the concept of SubRoutines, For Direct3D11, the new mechanism has been only available for Shader Model 5.0, meaning that even if this could greatly simplified the problem of abstraction, It is unfortunately only available for Direct3D11 class graphics card, which is of course a huge limitation...

But, here is the good news: While the classic usage of dynamic linking is not really possible from earlier version (like SM4.0 or SM3.0), I have found an interesting hack to bring some kind of closures and functions pointers to HLSL(!). This solution doesn't involve any kind of preprocessing directive and is able to work with SM3.0 and SM4.0, so It might be interesting for folks like me that like to abstract and reuse the code as often as possible! But let's see how It can be achieved...

A simple problem of abstraction and code reuse in HLSL

I have been working recently at my work on a GPU implementation of a versatile perlin/simplex/fbm/turbulence noise in HLSL. While some of the individual algorithm are pretty simples, it is often common to use several permutations of those functions in order to produce some nice noise and turbulences functions (like the worm-lava texture I did for Ergon 4k intro). Thus, they are an ideal candidate to demonstrate the use of closures and functions pointers. I won't explain here the basic principle of perlin and fbm noise generation to focus on the problem of code reuse in HLSL.

Here is a simplified version of a Turbulence Noise implemented in a Pixel Shader:

float PerlinNoise(float2 pos){
  ....
}

float AbsNoise(float2 pos) {
    return abs(PerlinNoise(pos));
}

float FBMNoise(float2 pos) {
    float value = 0.0f;
    float frequency = InitialFrequency;
    float amplitude = 1.0f;
    // Classic FBM loop
    for ( int i=0; i < Octaves; i++ )
    {
        float noiseValue = AbsNoise(pos);
        value += amplitude * noiseValue;
        frequency *= Lacunarity;
        amplitude *= Amplitude;
    }
    return value;
}

// Turbulence noise:
// Fbm + Abs + Perlin
float TurbulenceAbsPerlinNoisePS(float4 pos : SV_POSITION, float2 texPos : TEXCOORD0)
 : SV_Target
{
    return FBMNoise(texPos);
}

The problem with the previous code is that if we want to change the code behind AbsNoise called from FBMNoise (for example, apply cos/sin on the coordinates, or use of a simplex noise instead of the old Perlin Noise), we would have to duplicate the FBMNoise function to call the other function. Of course, we could use the preprocessor to inline the code, but It would end up in something less readable, less debuggable, error prone...etc.

Another example: Ken Perlin introduced some really cool functions to modify the noise, like the famous marble effect:

static float stripes(float x, float f) {
    float t = .5 + .5 * sin(f * 2*PI * x);
    return t * t - .5;
}

float MarbleNoise(float2 pos) { 
    return stripes(pos.x + 2 * FBMNoise(pos), 1.6f);
}

But wait! The MarbleNoise function could even be used in place of the AbsNoise function, in order to get another noise effect. So we could have a marble function calling a FBM... but we could also have a marble function called by a FBM... or both... ugh... so as we can see, It is possible to permute those functions to generate interesting patterns, but unfortunately, the shading language doesn't provide us a way to make those functions pluggable!... Almost! In fact, there is a small breach in the HLSL language and we are going to use it!

Introduction to Dynamic Linking in HLSL

So as I said in the introduction, Direct3D11 has introduced the concept of dynamic linking. I suggest the reader to go to an explanation on msdn "Interfaces and classes". Basically, the main feature introduced in the HLSL language is a bit of Object Oriented Programming (OOP) in order to address the problem of abstraction: Now HLSL has the class and interface keyword. But they were mainly introduced for dynamic linking of a shader, and as I said, dynamic linking is only available with SM5.0 profile.

// An interface describing a light
interface ILight {
    float3 ComputeAmbient(...);
    float3 ComputeDiffuse(...);
    float3 ComputeSpecular(...);
};

// A 1st implem of the ILight interface
class MyModelLight1 : ILight { 
    float3 ComputeAmbient(...) {
        ...
        return color;
    } 
    ...
};

// A 2ns implem of the ILight interface
class MyModelLight2 : ILight { 
    float3 ComputeAmbient(...) {
        ...
        return color;
    } 
    ...
}

// The variable through which we are going to access the light model
ILight abstractLight;

// We need to declare the two implems in order to get a reference 
// to them from C++ code
MyModelLight1  modelLight1;
MyModelLight2 modelLight2;

float4 PixelShader(PS_INPUT Input ) : SV_Target
{
    // Call the abstractLight that was previously setup by C++ at 
    // PixelShader creation time
    float3 ambient = abstractLight.ComputeAmbient(Input.Pos);
    float3 diffuse = abstractLight.ComputeDiffuse(Input.Pos);
    float3 specular = abstractLight.ComputeSpecular(Input.Pos);

    return float4(saturate( Ambient + Diffuse + Specular ), 1.0);
}

To be able to use this shader, we need to setup the abstractLight variable from the C++/C# code, through the usage of ID3D11Device::CreateClassLinkage and in the instatiation of a Pixel Shader ID3D11Device::CreatePixelShader.

As we can see, we need to declare the interface and classes variable globally, so that they can be accessed by the C++ program. This is the standard way to use dynamic linking in HLSL... but what If we want to use this differently?

Hacking function pointers in HLSL

The principle is very simple: Instead of using interface and classes as global variables, we can in fact use them as function parameters and even local variables from method. The way to use it is then straightforward:

// Base class for a calculator
interface ICalculator {
    float Compute(...);
};

// 1st implem of the calculator
class ClassicCalculator : ICalculator { 
    float Compute(...) {
        ...
        return value;
    } 
};

// 2nd implem of the calculator
class ComplexCalculator : ICalculator { 
    float Compute(...) {
        ...
        return value;
    } 
};

// A function using the interface ICalculator 
float MyFunctionUsingICalculator(ICalculator calculator, ...) {
    ...
    value += calculator.Compute(...);
    ...
    return value;
} 

// A Pixel shader using the ClassicCalculator
float PixelShader1(PS_INPUT Input ) : SV_Target
{
    ClassicCalculator classic;
    return MyFunctionUsingICalculator(classic, ...);
}

// A Pixel shader using the ComplexCalculator
float PixelShader2(PS_INPUT Input ) : SV_Target
{
    ComplexCalculator complex;
    return MyFunctionUsingICalculator(complex, ...);
}

The previous example could be compiled flawlessly with ps_4_0 (Shader Model 4) or ps_3_0 (with some minor changes for the pixel shader), and It would compile just fine! So basically, the interface ICalculator is acting as a function pointer, that has two implementations available through the ClassicCalculator and ComplexCalculator classes. MyFunctionUsingICalculator doesn't have to change its signature to adapt to the underlying function, so as we can see, we have a suitable solution for developing function pointers in HLSL.

Now, lets try to see if we could use this model to build our flexible noise functions. Replace ICalculator by a INoise interface. We are seeing that an implementation would have to call another INoise interface. In fact, ideally, we would like to code something like this:

// Base class for a noise function
interface INoise {
    float Compute(...);
};

// Perlin noise implem
class PerlinNoise : INoise { 
    float Compute(...) {
        ...
        return value;
    } 
};

// FBM noise implem
class FBMNoise : INoise { 
    // Would be ideal to be able to do that
    // We could even make an abstract generic class 
    // that could provide a base Source INoise
    // BUT, THIS IS NOT COMPILING!!!
    INoise Source;

    float Compute(...) {
        float value = 0.0f;
        float frequency = InitialFrequency;
        float amplitude = 1.0f;
        // Classic FBM loop
        for ( int i=0; i < Octaves; i++ )
        {
            // Call the source abstract INoise
            float noiseValue = Source.Compute(pos);
            value += amplitude * noiseValue;
            frequency *= Lacunarity;
            amplitude *= Amplitude;
        }
        return value;
    } 
};


// A Pixel shader using the FBMNoise combined with PerlinNoise
float PixelShader1(PS_INPUT Input ) : SV_Target
{
    FBMNoise fbmNoise;
    PerlinNoise perlin;
    // This is not possible, interface variable members are not allowed
    fbmNoise.Source = perlin;
    return fbmNoise.Compute(...);
}

Unfortunately, HLSL doesn't permit the use of interface as variable members!. This limitation was quite annoying, as It excludes a whole range of combination, like aggregation, composition... making these function pointers useful only for a very limited set of cases...
I have tried to overcome this problem using abstract class instead of interface, as classes can be declared as variable members of classes... but, again, there is a huge limitation: The class variable is in fact acting a a final or const variable that cannot be changed, thus making its usage almost useless...
But I knew that HLSL permits lots of unusual constructions, and this is where closures are going to resolve this.

Hacking Closures in HLSL

So we know that interfaces can be used as function pointers, but their usage is limited as we cannot use anykind of composition. An interesting fact is that we can declare local variables in methods as being class or interfaces... The trick is to use a quite uncommon feature of HLSL: It is possible to declare local classes inside a method, that can access local parameters! Therefore, It is possible to use a kind of deferred composition/aggregation using this technique. Let's rewrite our noise functions using this new closure technique:

1. Declare a INoise interface that is able to compute the noise by using a next INoise implementation.

// It is possible to compile this code under ps_4_0 and ps_3_0

// Declare our INoise interface
interface INoise {
    // Here an interesting hack: We can declare a method that is returning a INoise 
    // interface. This method will be implemented by the pixel shaders. 
    INoise Next();
    
    // The compute method of a Noise
    float Compute(float2 pos);
};

2. Declare NoiseBase as an abstract implementation of INoise that is implementing the methods. If we had the keyword abstract in hlsl we wouldn't have to implement methods of this class.

// We are creating an abstract class from INoise in order
// to implement both methods
class NoiseBase : INoise {
    INoise Next() {
        // This code will never be used. It is only 
        // used to declare this class
        NoiseBase base;
        return base;
    }

    float Compute(float2 pos) {
        // This code will never be used. It is only 
        // used to declare this class
        return Next().Compute(pos);
    }
};

3. Use NoiseBase to implement final INoise functions. If you look at AbsNoise, FbmNoise or MarbleNoise, they are using the INoise::Next() method to get an instance of the INoise interface they rely on. This is where functions pointers are extremely useful here.

// PerlinNoise implem
class PerlinNoise : NoiseBase {
    float Compute(float2 pos) {
        // call a standard perlin_noise implemented as a simple external function
        return perlin_noise(pos);
    }
};

// AbsNoise implem
class AbsNoise : NoiseBase {
    float Compute(float2 pos) {
        // Note: We are using Next to access the next underlying function pointer
        return abs(Next().Compute(pos));
    }
};

// FbmNoise implem
class FbmNoise : NoiseBase {
    float Compute(float2 pos) {
        float value = 0.0f;
        float amplitude = 1.0f;
        float frequency = InitialFrequency;
        for ( int i=0; i < Octaves; i++ )
        {
            float noiseValue = Next().Compute(pos);
            value += amplitude * noiseValue;
            frequency *= Lacunarity;
            amplitude *= Amplitude;
        }
        return value;
    }
};

// MarbleNoise implem
class MarbleNoise : NoiseBase {
    float Compute(float2 pos) { 
        return stripes(2 * Next().Compute(pos, frequency), 1.6f);
    }

    static float stripes(float x, float f) {
        float t = .5 + .5 * sin(f * 2*PI * x);
        return t * t - .5;
    }
};

4. Implements the pixel shaders with the closure mechanism. We are declaring local classes that will override INoise::Next() method in order to chain INoise function pointers together.

// Fbm -> PerlinNoise
float FbmPerlinNoise2DPS( float4 pos : SV_POSITION, float2 texPos : TEXCOORD0 )
 : SV_Target
{
    // Look! We are declaring a local class
    class Noise1 : PerlinNoise {} noise1;
    // and this local classs can access local variable!
    // For example, Noise2 can access previous noise1 variable.
    class Noise2 : FbmNoise { INoise Next() { return noise1; } } noise2;

    // Allowing us to cascade the calls and making a kind of deferred composition.
    return noise2.Compute(texPos);
}

// Fbm -> Abs -> PerlinNoise
float FbmAbsPerlinNoise2DPS( float4 pos : SV_POSITION, float2 texPos : TEXCOORD0 )
 : SV_Target
{
    class Noise1 : PerlinNoise {} noise1;
    class Noise2 : AbsNoise { INoise Next() { return noise1; } } noise2;
    class Noise3 : FbmNoise { INoise Next() { return noise2; } } noise3;

    // FbmNoise is calling indirectly AbsNoise that will call PerlinNoise.
    return noise3.Compute(texPos);
}

// Marble -> Fbm -> Abs -> PerlinNoise
float FbmAbsPerlinNoise2DPS( float4 pos : SV_POSITION, float2 texPos : TEXCOORD0 )
 : SV_Target
{
    class Noise1 : PerlinNoise {} noise1;
    class Noise2 : AbsNoise { INoise Next() { return noise1; } } noise2;
    class Noise3 : FbmNoise { INoise Next() { return noise2; } } noise3;
    class Noise4 : MarbleNoise { INoise Next() { return noise3; } } noise4;

    // MarbleNoise is calling FbmNoise that is calling indirectly AbsNoise 
    // that will call PerlinNoise.
    return noise4.Compute(texPos);
}


// Fbm -> Marble -> Abs -> PerlinNoise
float FbmAbsPerlinNoise2DPS( float4 pos : SV_POSITION, float2 texPos : TEXCOORD0 )
 : SV_Target
{
    class Noise1 : PerlinNoise {} noise1;
    class Noise2 : AbsNoise { INoise Next() { return noise1; } } noise2;
    class Noise3 : MarbleNoise { INoise Next() { return noise2; } } noise3;
    class Noise4 : FbmNoise { INoise Next() { return noise3; } } noise4;

    // FbmNoise is calling MarbleNoise that is calling indirectly AbsNoise 
    // that will call PerlinNoise.
    return noise4.Compute(texPos);
}

Et voila! As you can see, we are able to declare local classes from a pixel shader that are acting as closures. It is for example even possible to declare local classes that have a specific code in their Compute() methods.
Behind the scene, when chaining the INoise::Next() methods, the fxc HLSL compiler is seeing all thoses classes as "INoise*".
It is then possible to perform a fbm(marble(abs(perlin_noise()))) as well as a marble(fbm(abs(perlin_noise()))).

In the end, It is effectively possible to implement closures in HLSL that can be used in SM4.0 as well as SM3.0!

Improving closures chaining

From the previous example, we can extend the concept by
1. Adding static local constructors to each Noise function :

// PerlinNoise implem
class PerlinNoise : NoiseBase {
    float Compute(float2 pos) {
        // call a standard perlin_noise implemented as a simple external function
        return perlin_noise(pos);
    }
    // Add local "constructor"
    static INoise New() {
        PerlinNoise noise;
        return noise;
    }
};

// AbsNoise implem
class AbsNoise : NoiseBase {
    float Compute(float2 pos) {
        // Note: We are using Next to access the next underlying function pointer
        return abs(Next().Compute(pos));
    }
    // Add local constructor and chain with From INoise
    static INoise New(INoise from) {
        class LocalNoise : AbsNoise { INoise Next() { return from; } } noise;
        return noise;
    }
};

// Add the same constructors to FbmNoise and MarbleNoise.
// ....

2. And then we can rewrite the Pixel shader functions to chain operators in a shorter form:

// Fbm -> Marble -> Abs -> PerlinNoise
float FbmAbsPerlinNoise2DPS( float4 pos : SV_POSITION, float2 texPos : TEXCOORD0 )
 : SV_Target
{
    // FbmNoise is calling MarbleNoise that is calling indirectly AbsNoise 
    // that will call PerlinNoise.
    return FbmNoise::New(MarbleNoise::New(AbsNoise::New(PerlinNoise::New()))).Compute(texPos);
}

This way, It allows a syntax that is even more concise and modular!

Further Considerations

This is a very exciting technique that could open lots of abstraction opportunities while developing in HLSL. Though, in order to use this technique, there are a couple of advantages and things to take into account:

An interface cannot inherit from another interface (that would be really interesting)
An interface can only have method members.
A class can inherit from another class and from several interfaces.
Unlike in C/C++, we cannot pre-declare an interface, but we can use a declaration being declared (See the example of the method INoise::Next, returning a INoise).
The compiler has a limitation against the reuse of an implementation in a call chain and will complain about a recursive call (even if there is no recursive call at all): For example, It is not possible to reuse twice the sample type of class closure in a call chain, meaning that it is not possible to make a call chain like this one: Marble => FBM => Marble => Abs => Perlin. The fxc compiler would complain about the second "Marble" as It would see it as a kind of recursive call. In order to reuse a function, we need to duplicate it, that's probably the only point that is annoying here.
Generated compiled asm output from closures are exactly the same as using standard inlining methods.
Before going to local class-closure, I have tried several techniques that were sometimes crashing fxc compiler.
Thus, as it is a way of hacking the usage HLSL, It is not guarantee that this will be supported in the future. But at least, if it is working for SM5.0, SM4.0 and 3.0, we can expect that we are safe for a while!
Also, the compilation time under vs_3_0/ps_3_0 profile seems to take more time, not sure if its the language construction or a regular behavior of 3.0 profiles.

Let me know if you are able to use this technique and If you are finding other interesting constructions or problems. That would be very interesting to dig a little more into the opportunities it opens. Lastly, I have done a small google search about this kind of technique, but didn't found anything... but It could have been used already by someone else, thus this whole technique is a new hypothetical discovery, but I enjoyed a lot to discover it!

Direct3D11 multithreading micro-benchmark in C# with SharpDX

2011-11-18T00:46:00.001+11:00

Multi-threading is an important feature added to Direct3D11 two years ago and has been increasingly used on recent game engine in order to achieve better performance on PC. You can have a look at "DirectX 11 Rendering in Battlefield 3" from Johan Anderson/DICE which gives a great insight about how it was effectively used in practice in their game engine. Usage of the Direct3D11 multithreading API is pretty straightforward, and while we are also using it successfully at our work in our R&D 3D Engine, I didn't take the time to sit down with this feature and check how to get the best of it.

I recently came across a question on the gamedev forum about "[DX11] Command Lists on a Single Threaded Renderer": If command lists are an efficient way to store replayable drawing commands, would it be efficient to use them even in a single threaded scenario where lots of drawing commands are repeatable?

In order to verify this, among other things, I did a simple micro-benchmark using C#/SharpDX, but while the results are somehow expectable, there are a couple of gotchas that deserve a more in-depth look...

Direct3D11 Multi-threading : The basics

I assume that general multi-threading concepts and advantages are already understood to focus on Direct3D11 multi-threading API.

There is already a nice "Introduction to Multithreading in Direct3D11" on msdn that is worth reading if you are already a little bit familiar with the Direct3D11 API.

In Direct3D10, we had only a class ID3D10Device to perform object/resource creation and draw calls, the API was not thread safe, but It was possible to emulate some kind of deferred rendering by using mutexes and a simplified command buffers to access safely the device.

In Direct3D11, preparation of the draw calls are now "parralelizable" while object/resource creation is thread safe. The API is now split between:

ID3D11Device which is responsible to create object/resources/shaders and device contexts.
ID3D11DeviceContext which holds all commands to setup shaders pipeline and perform all draw calls (including constant buffer update, setup of shader resource views, samplers, blendstate...etc.)

When a Direct3D11 device is created, it provides a default ID3D11DeviceContext called an immediate context that is effectively used for immediate rendering. There is only one immediate context available per device.

In order to use deferred rendering, we need to create new ID3D11DeviceContext called deferred context. One context for each thread responsible for preparing a set of draw calls.

Then the sequence of multithreaded draw calls are executed like this:

Each secondary threads are responsible to prepare draw calls in a set of ID3D11CommandList that will effectively be executed by the immediate context (in order to push them to the driver).

The simplified version of the code to write is fairly easy:

// Thread-1
context[threadIdn].InputAssembler.InputLayout = layout1;
context[threadIdn].InputAssembler.PrimitiveTopology = PrimitiveTopology.TriangleList;
context[threadIdn].InputAssembler.SetVertexBuffers(0, new VertexBufferBinding(vertices1, Utilities.SizeOf<Vector4>() * 2, 0));
[...]
context[threadId1].Draw(...)
commandLists[threadId1] = context[ThreadId1].FinishCommandList(false);
[...]
// Thread-n
context[threadIdn].InputAssembler.InputLayout = layoutn;
context[threadIdn].InputAssembler.PrimitiveTopology = PrimitiveTopology.TriangleList;
context[threadIdn].InputAssembler.SetVertexBuffers(0, new VertexBufferBinding(verticesn, Utilities.SizeOf<Vector4>() * 2, 0));
[...]
context[threadIdn].Draw(...)
commandLists[threadIdn] = context[ThreadIdn].FinishCommandList(false);

// Rendering Thread
for (int i = 0; i < threadCount; i++)
{
 var commandList = commandLists[i];
 // Execute the deferred command list on the immediate context
 immediateContext.ExecuteCommandList(commandList, false);
 commandList.Dispose();
}

The API provides several key advantages:

We can easily switch the code between immediate context and deferred context. Thus using the multi-threading part of the Direct3D11 API doesn't hurt our code.
The API is supported on downlevel hardware (from Direct3D11 down to Direct3D9)
The underlying driver can take advantages when calling FinishCommandList to perform some native layout that will help the deferred ExecuteCommandList command to run faster.

About the "native support from driver", It can be checked by using CheckFeatureSupport (or directly in SharpDX using CheckThreadingSupport) but it seems that almost only NVIDIA (and quite recently, around this year), is supporting this feature natively. On my previous ATI 6850 and now on my 6900M are not supporting it. Is this bad? We will see that the default Direct3D11 runtime is performing just fine for this, but doesn't provide any extra boost.

We will also see that there is an interesting issue with the usage of Map/Unmap or UpdateSubresource in order to update constant buffers, and their respective usage under a multithreading scenario could hurt performances.

MultiCube, a Direct3D11 Multi-threading micro-benchmark

In order to stress-test multi-threading using Direct3D11, I have developed a simple application called MultiCube (available as part of SharpDX samples: See Program.cs)

This application is performing the following benchmark: It renders n x n cubes on the screen, each cube has its own matrix rotation. You can modify the number of cubes from 1 (1x1) to 65536 (256x256). The title bar is including some benchmark measurement (FPS/ time per frame) and you can change the behavior of the application with following keys:

F1: Switch between Immediate Test (no threading), Deferred Test (Threading), and Frozen-Deferred Test (execute a pre-prepared CommandList on the ImmediateContext)
F2: Switch between Map/Unmap mode and UpdateSubresource mode to update constant buffers.
F3: Burn the CPU on/off. This is were multithreading usage is making the difference and we are going to analyse the results a little bit more. When this option is on, It simulates lots of CPU calculation on the deferred threads. If this is off, It will just batch the draw calls (which are simple, its just Cubes!)
Left-Right arrows: Decrease/Increase the number of cubes to display (default 64x64)
Down-Up arrows: Decrease/Increase the number of threads used (only for Deferred Test mode)

When the deffered mode is selected, each threads are rendering a set of rows in batch. If you have for example 100x100 cubes to render, and 5 threads, each thread will draw 20x100 cubes.

If your graphics driver doesn't support natively multithreading, you will see a "*" just after Deferred node.

You can download the application here. It is a single exe that doesn't need anykind of install (apart the DirectX June 2010 runtime). Also, being able to pack this application into a single exe is a unique feature of SharpDX: static linking of a .NET exe with SharpDX Dlls.

Results

I ran 2 type of tests:

Draw 65536 cubes with the Burn-Cpu option ON and OFF, and comparing Immediate and Deferred rendering (ranging from 1 thread to 6 threads).
Draw 1024 cubes switching between Map/Unmap and UpdateSubresource, and comparing the results between Immediate and Deferred rendering.

Two machines with the same main processor Intel i7-2600K, 8Go RAM were used, one with NVIDIA GTX 570 and the other one with a ATI 6900M graphics card.

65536 Drawcalls - BurnCpu: On	Threads
Type	1	2	3	4	6
Nvidia-GTX 570 Deferred	232ms	130ms	98ms	92ms	82ms
Nvidia-GTX 570 Immediate	220ms	220ms	220ms	220ms	220ms
ATI 6900M Deferred	231ms	131ms	98ms	93ms	84ms
ATI 6900M Immediate	228ms	228ms	228ms	228ms	228ms

Fig2. 65536 draw calls with CPU intensive threads, comparison between Immediate and Deferred rendering

65536 Drawcalls - BurnCpu: Off	Threads
Type	1	2	3	4	6
Nvidia-GTX 570 Deferred	31ms	24ms	21ms	20ms	20ms
Nvidia-GTX 570 Immediate	19ms	19ms	19ms	19ms	19ms
ATI 6900M Deferred	32ms	28ms	28ms	28ms	28ms
ATI 6900M Immediate	28ms	28ms	28ms	28ms	28ms

Fig2. 65536 draw calls with CPU ligh threads, comparison between Immediate and Deferred rendering

And finally the Map/Unmap and UpdateSubresource test:

65536 Drawcalls - Type	Map	Update
Nvidia-GTX 570 Immediate - 1024	0.6ms	1.1ms
Nvidia-GTX 570 Deferred - 1024	0.92ms	7.32ms
ATI 6900M Immediate - 1024	0.6ms	0.6ms
ATI 6900M Deferred - 1024	0.6ms	0.6ms

Analysis

If we examine the results a little more carefully, there are a couple of interesting things to highlight:

Using multithreading and deferred context rendering is only relevant when the CPU is effectively used on each threads (that sounds obvious, but It is at least clear!). When we are not using the CPU, Immediate rendering is in fact faster!
Multithreading rendering with CPU intensive application can perform 3-4x times faster than a single threaded application (at the condition that we have enough CPU core to dispatch rendering jobs)
The "native support from driver" of Direct3D11 multithreading doesn't seem to change so much, compare to the NVIDIA graphics card that is supporting it, we don't see a huge difference with AMD.
Usage of UpdateSubresource on a NVIDIA card is 8x times slower in a multithreading situation and is hurting a lot the performance of the application: Use Map/Unmap instead!

Of course, as usual, this is a synthetic, micro-benchmark test that should be taken with caution and can not reflect every test cases, so you need to perform your own benchmark if you have to make the decision of using multithreading rendering!

Finally, to respond to the original gamedev question, I provided a "Frozen Deferred" test in MultiCube to test if rendering a pre-prepared CommandList is actually faster then executing it with an immediate context: It seems that It doesn't make currently any differences (but for this to be sure, I would have to run this benchmark on several different machines/CPU/graphics card/drivers configs in order to fully verify it).

Managed DirectX with Win8 Metro Style App

2011-10-02T21:19:00.002+11:00

Since my last post is quite old, it's time to give more insight about the latest features included in the recent release of SharpDX 2.0 beta. This release is a major step for SharpDX as it includes lots of new APIs but also provides a preview of developing Managed DirectX from a Windows 8 Metro style application. In the meantime, I have been pretty busy as I got a job in a gamedev company located in Japan that is developing a new rendering engine in C#... which is using SharpDX for its windows rendering, and that's extremely exciting to work on a project like this (and of course rewarding for all the investement done in SharpDX!).

While SharpDX is starting to cover almost the whole DirectX API as well as some new Windows Multimedia API (like WIC), I put some effort, just after the Microsoft Build conference, on providing a preview of using SharpDX from a Win8 Metro style application, which is a fantastic opportunity for a C#/.Net developer to use DirectX from both a Metro style application and Desktop application using the same API.

Finally SharpDX is also going to get some new APIs and some interesting features in the following months, especially in the performance domain, to reduce as much as possible the difference of performance between a native C++ DirectX application and a managed C# DirectX application, that will deserve a post on its own. But lets start with what has been achieved in the latest 2.0 version!

SharpGen a new C# code generation tools from C++

The core changes behind the 2.0 version was the rewrite of the code generator used to generate C# interop code from DirectX/Windows SDK headers, a tool called SharpGen. When I released the first version of SharpDX, I used a handwritten C++ custom parser that was somewhat able to parse almost all DirectX headers, but It was of course a temporary workaround... End of December last year, I was evaluating two options to rewrite the parser stage:

CLang seemed to be a suitable solution for the future, but at that time, I was not sure that It would be able to parse all windows headers flawlessly (while being able to grab all Microsoft specialized SAL-annotations from their headers in order to extract useful information used by the parser) and It was not really easy to handle (I need to wrap the library into a managed component).
gccxml is an older project, but is able to parse almost all Microsoft headers with a direct command line and It is quite easy to patch the headers that are not working. Gccxml is generating an xml file, which is pretty easy to parse.

So I made the choice to go with gccxml. Bur this is fortunately not a huge design issue, as it is quite easy in the current system to plug another parser, because the C++ model used by SharpGen is independent from the parser.

The parser was one thing to rewrite in the old system. The other part was to use a new data-driven config files for all mapping rules (used to translate C++ objects to C#). The previous version was quickly developed using hard-coded rules directly written in the C# program. The new version is using simple xml config files to express all the mappings and dependencies. The good thing is that I didn't have to change so many things in the code generator, though lots of work was required to efficiently manage those configs files and their dependencies. You can have a look at the result, for example with SharpDX.DXGI mapping file. The new system has lots of features to manage all the cases that were found during the construction of all these mapping rules. It is also able to efficiently generate a subset of the files, when only parts of the config files are changed and they don't affect other dependent projects (For example, SharpDX.Direct3D11 has some dependencies on D3DCompiler, DXGI and the core SharpDX assembly. If any of the dependent config files are changed, I had to regenerate Direct3D11 as well). All this hard work was clearly done in order to have a system easier to maintain and to fix. The new generator is also able to handle all Windows headers files, including all multimedia files that were not part of the DirectX SDK headers (like WIC).

Thanks to this new tool, It is now possible to integrate in the generated SharpDX assemblies some APIs that were not part of DirectX, but are hightly related to the development of multimedia applications, and that's really a good news for any developer seeking for an unified managed API for all the Windows Graphics and Multimedia APIs.

I do plan also to release SharpGen as an external tool so that you can use it in your own project. It will allow a developer to generate easily interop mapping from C++ headers (with a COM oriented APIs)

Mapping all DirectX APIs

The next step was to write mapping rules and C# extensions for all the remaining APIs. The first version of SharpDX was providing managed APIs for DXGI, Direct3D10, Direct3D11, Direct2D, DirectWrite, DirectSound, XAudio2 but I had to work on older APIs that were more laborious to map like Direct3D9, DirectInput or minor APIs like X3DAudio, XACT3, RawInput, XInput.

I decided also to provide a managed API for WIC (Windows Imaging Component) that is tightly used with Direct2D.

So far, the result is great, except for Direct3D9 that still requires more work, all the DirectX APIs are now provided. SharpDX is also the only managed API to provide an implementation for all the callbacks used in Direct2D/DirectWrite, allowing a full access of these APIs.

I took also the opportunity to rewrite the callback system (C++ is calling back C# objects as they are exposed to the C++ as COM objects). The new system is more efficient and reliable.

Using SharpDX from a Win8 Metro Style Application

A new exciting feature that was really easy to provide is the ability to use SharpDX from a Win8 Metro Style Application. In the meantime, the 2.0 version is also providing an access to the upcoming DirectX11.1 APIs (that includes DXGI 1.2, Direct2D.1 Direct3D11.1, new D3DCompiler, new XAudio2... etc.).

Two samples were ported: one from the Win8 DirectX tutorials that is only clearing the screen. The other is a direct port of SharpDX Direct3D11 MiniCube, a simple application that displays a 3D spinning cube on the screen. You can download the preview archive, compile and run it from a Win8 machine!

In order to be fully compatible with Win8 metro application and to be able to develop application that will get certification on the Microsoft AppStore, some SharpDX internals are being rewritten in order to remove dependencies to legacy Win32 and .Net Framework APIs that are not supported by metro style application. Next version of SharpDX will provide assemblies that will be fully compatible and certified-ready for Win8 Metro style application.

The great thing is that It will be possible to target desktop application and Win8 metro style application using the same managed DirectX API.

You may ask if pure DirectX WinRT components could be generated using SharpDX technology? : in short, yes It is possible (as SharpGen could be modified to generate interop code for other languages), but I don't think this a a good opportunity for .NET developers, as the WinRT system will only be available to pure Win8 Metro style Application. Plus, WinRT has a significant performance cost when you want to have access to static properties/methods (that implies severals method calls : QueryInterface..etc).

Also, I'm not entirely convinced that the new WinRT COM interop projection in .NET is as fast as custom interop used by SharpDX. I didn't have yet the time to write a full benchmark test, but I found some unnecessary generated interop methods by using WinRT components from .NET. So when I will have a chance to install Win8 on a real machine (and not in the VM), I will post a more detailed benchmark about WinRT calling cost from a managed application.

Next?

There are lots of new exciting features that will also be available in next versions of SharpDX, but I can't talk about them yet. In short, SharpDX will get new APIs binding for some Windows multimedia APIs (for example, someone suggested me on twitter to provide the UIAnimation Manager API that I just started today) and will provide some great performance benefits as well.

Also, I have seen a growing interest from XNA developers about SharpDX. It seems that the lack of communication about the future of XNA during and after the Build conference has generated lots of trouble in the XNA community. I don't know if XNA is going to be abandoned by Microsoft. It may be probably a bit to early to anticipate this. On the other hand, I wrote a couple of months ago a significant part of the Graphics layer of XNA on top of SharpDX, using internally Direct3D11, but I didn't release it, as it was not yet in a usable state. After the final 2.0 version of SharpDX will be released, I will have a look to see if this framework could be released as-is, in a state that would require some work from the SharpDX community.

Finally, I'm receiving some great feedbacks from developers using SharpDX in some commercial projects and that's really great! I would be glad to hear more news from users, even If it can't be public, that's extremely interesting to understand SharpDX usages and improve the API experience as well.

Stay tuned!

Benchmarking C#/.Net Direct3D 11 APIs vs native C++

2011-03-15T00:11:00.004+11:00

[Update 2012/05/15: Note that the original code was fine tuned to a particular config and may not give you the same results. I have rewritten this sample to give more accurate and predictible results. Comparison with XNA was also not fair and inaccurate, you should have something like x4 slower instead of x9. SharpDX latest version 2.1.0 is also x1.35 slower than C++ now. An update of this article will follow on new sharpdx.org website]
[Update 2014/06/17: Remove XNA comparison, as it is not fair and relevant]

If you are working with a managed language like C# and you are concerned by performance, you probably know that, even if the Microsoft JIT CLR is quite efficient, It has a significant cost over a pure C++ implementation. If you don't know much about this cost, you have probably heard about a mean cost for managed languages around 15-20%. If you are really concern by this, and depending on the cases, you know that the reality of a calculation-intensive managed application is more often around x2 or even x3 slower than its C++ counterpart. In this post, I'm going to present a micro-benchmark that measure the cost of calling a native Direct3D 11 API from a C# application, using various API, ranging from SharpDX, SlimDX, WindowsCodecPack.

Why this benchmark is important? Well, if you intend like me to build some serious 3D games with a C# managed API (don't troll me on this! ;) ), you need to know exactly what is the cost of calling intensively a native Direct3D API (mainly, the cost of the interop between a managed language and a native API) from a managed language. If your game is GPU bounded, you are unlikely to see any differences here. But if you want to apply lots of effects, with various models, particles, materials, playing with several rendering targets and a heavy deferred rendering technique, you are likely to perform lots of draw calls to the Direct3D API. For a AAA game, those calls could be as high as 3000-7000 draw submissions in instancing scenarios (look at latest great DICE publications in "DirectX 11 Rendering in Battlefield 3" from Johan Andersson). If you are running at 60fps (or lower 30fps), you just have 17ms (or 34ms) per frame to perform your whole rendering. In this short time range, drawing calls can take a significant amount of time, and this is a main reason why multi-threading batching command were introduced in DirectX11. We won't use such a technique here, as we want to evaluate raw calls.

As you are going to see, results are pretty interesting for someone that is concerned by performance and writing C# games (or even efficient tools for a 3D Middleware)

The Managed (C#) to Native (C++) interop cost

When a managed application needs to call a native API, it needs to:

Marshal method/function arguments from the managed world to the unmanaged world
The CLR has to switch from a managed execution to an unmanaged environment (change exception handling, stacktrace state...etc.)
The native methods is effectively called
Than you have to marshal output arguments and results from unmanaged world to managed one.

To perform a native call from a managed language, there is currently 3 solutions:

Using the default interop mechanism provided under C# is P/Invoke, which is in charge of performing all the previous steps. But P/Invoke comes at a huge cost when you have to pass some structures, arrays by values, strings...etc.
Using a C++/CLI assembly that will perform a marshaling written by hand to the native C++ methods. This is used by SlimDX, WindowsCodePack and XNA.
Using SharpDX technique that is generating all the marshaling and interop at compile time, in a structured and consistent way, using some missing CLR bytecode inside C# that is usually only available in C++/CLI

The marshal cost is in fact the most expensive one. Usually, calling directly a native function without performing any marshaling has a cost of 10% which is fine. But if you take for example a slightly more complex functions, like ID3D11DeviceContext::SetRenderTargets, you can see that marshaling takes a significant amount of code:

/// <unmanaged>void ID3D11DeviceContext::OMSetRenderTargets([In] int NumViews,[In, Buffer, Optional] const ID3D11RenderTargetView** ppRenderTargetViews,[In, Optional] ID3D11DepthStencilView* pDepthStencilView)</unmanaged>
public void SetRenderTargets(int numViews, SharpDX.Direct3D11.RenderTargetView[] renderTargetViewsRef, SharpDX.Direct3D11.DepthStencilView depthStencilViewRef) {
    unsafe {
        IntPtr* renderTargetViewsRef_ = (IntPtr*)0;
        if ( renderTargetViewsRef != null ) {
            IntPtr* renderTargetViewsRef__ = stackalloc IntPtr[renderTargetViewsRef.Length];
            renderTargetViewsRef_ = renderTargetViewsRef__;
            for (int i = 0; i < renderTargetViewsRef.Length; i++)                        
                renderTargetViewsRef_[i] =  (renderTargetViewsRef[i] == null)? IntPtr.Zero : renderTargetViewsRef[i].NativePointer;
        }
        SharpDX.Direct3D11.LocalInterop.Callivoid(_nativePointer, numViews, renderTargetViewsRef_, (void*)((depthStencilViewRef == null)?IntPtr.Zero:depthStencilViewRef.NativePointer),((void**)(*(void**)_nativePointer))[33]);
    }
}

In the previous sample, there is no structure marshaling involved (that are even more costly than pure method arguments marshaling), and as you can see, the marshaling code is pretty heavy: It has to handles null parameters, transform an array of managed DirectX interfaces to a respective array of native COM pointer...etc.

Hopefully, in SharpDX unlike any other DirectX .NET APIs, this code has been written to be consistent over the whole generated code, and was carefully designed to be quite efficient... but still, It has obviously a cost, and we need to know it!

Protocol used for this micro-benchmark

Writing a benchmark is error prone, often subject to caution and relatively "narrow minded". Of course, this benchmark is not perfect, I just hope that It doesn't contain any mistake that would give false results trend!

In order for this test to be closer to a real 3D application usage, I made the choice to perform a very basic test on a sequence of draw calls that are usually involved in common drawing calls scenarios. This test consist of drawing triangles using 10 successive effects (Vertex Shaders/Pixel Shaders), with their own vertex buffers, setting the viewport and render target to the backbuffer. This loop is then ran thousand of times in order to get a correct average.

The SharpDX main loop is coded like this:

var clock = new Stopwatch();
clock.Start();
for (int j = 0; j < (CommonBench.NbTests + 1); j++)
{
    for (int i = 0; i < CommonBench.NbEffects; i++)
    {
        context.InputAssembler.SetInputLayout(layout);
        context.InputAssembler.SetPrimitiveTopology(PrimitiveTopology.TriangleList);
        context.InputAssembler.SetVertexBuffers(0, vertexBufferBindings[i]);
        context.VertexShader.Set(vertexShaders[i]);
        context.Rasterizer.SetViewports(viewPort);
        context.PixelShader.Set(pixelShaders[i]);
        context.OutputMerger.SetTargets(renderView);
        context.ClearRenderTargetView(renderView, blackColor);
        context.Draw(3, 0);
    }
    if (j > 0 && (j % CommonBench.FlushLimit) == 0)
    {
        clock.Stop();
        Console.Write("{0} ({3}) - Time per pass {1:0.000000}ms - {2:000}%\r", programName, (double)clock.ElapsedMilliseconds / (j * CommonBench.NbEffects), j * 100 / (CommonBench.NbTests), arch);
        context.Flush();
        clock.Start();
    }
}

The VertexShader/PixelShaders involved is basic (just color passing between VS and PS, no WorldProjectionTransform applied), the context.Flush is used to avoid measuring flush of commands to the GPU. The CommonBench.FlushLimit value was selected to avoid any stalls from the GPU.

I have ported this benchmark under:

C++, using raw native calls and Direct3D11 API
SharpDX, using Direct3D11 running under Microsoft .NET CLR 4.0 and with Mono 2.10 (both trying llvm on/off). SharpDX is the only managed API to be able to run under Mono.
SlimDX using Direct3D11 running under Microsoft .NET CLR 4.0. SlimDX is "NGENed" meaning that it is compiled to native code when you install it.
WindowsCodePack 1.1 using Direct3D11 running under Microsoft .NET CLR 4.0

It has been tested on a Win7-64bit, i5-750 2.6Ghz, Gfx AMD HD6950. All tests were done both in x86 and x64 mode, in order to measure the platform impact of the calling conventions. Tests were ran 4 times for each API, taking the average of the 3 lowest one.

Results

You can see the raw results in the following table. Time is measured for the simple drawing sequence (inside the loop for(i) nbEffects). Lower is better. The ratio on the right indicates how much is slower the tested API compare to the C++ one. For example, SharpDX in x86 mode is running 1,52 slower than its pure C++ counterpart.

Direct3D11 Simple Bench	x86 (ms)	x64 (ms)	x86-ratio	x64-ratio
Native C++ (MSVC VS2010)	0.000386	0.000262	x1.00	x1.00
Managed SharpDX (1.3 MS .Net CLR)	0.000585	0.000607	x1.52	x2.32
Managed SlimDX (June 2010 - Ngen)	0.000945	0.000886	x2.45	x3.38
Managed SharpDX (1.3 Mono-2.10)	0.002404	0.001872	x6.23	x7.15
Managed Windows API CodePack 1.1	0.002551	0.003219	x6.61	x12.29

And the associated graphs comparison both for x86 and x64 platforms:

Results are pretty self explanatory. Although we can highlight some interesting facts:

Managed Direct3D API calls are much slower than native API calls, ranging from x1.52 to x10 depending on the API you are using.
SharpDX is providing the fastest Direct3D managed API, which is ranging only from x1.52 to x2.32 slower than C++, at least 50% faster than any other managed APIs.
All other Direct3D managed API are significantly slower, ranging from x2.45 to x12.29
Running this benchmark with SharpDX and Mono 2.10 is x6 to x7 times slower than SharpDX with Microsoft JIT (!)

Ok, so if you are a .NET programmer and are not aware about performance penalty using a managed language, you are probably surprised by these results that could be... scary! Although, we can balance things here, as your 3D engine is unlikely to be CPU bounded on drawing calls, but 3000-7000 calls could lead to a 4ms impact in the better case, which is something we need to know when we design a game.

This test could be also extrapolated to other parts of a 3D engine, as It will probably slower by a factor of x2 compare to a brute force C++ engine. For AAA game, this would be of course an unacceptable performance penalty, but If you are a small/independent studio, this cost is relatively low compare to the cost of efficiently developing a game in C#, and in the end, that's a trade-off.

In case you are using SharpDX API, you can still run at a reasonable performance. And if you really want to circumvent this interop cost for chatty API scenarios, you can design your engine to call a native function that will batch calls to the Direct3D native API.

You can download this benchmark Sharp3DBench.7z.

Crinkler secrets, 4k intro executable compressor at its best

2010-12-29T04:41:00.027+11:00

(Edit 5 Jan 2011: New Compression results section and small crinkler x86 decompressor analysis)

If you are not familiar with 4k intros, you may wonder how things are organized at the executable level to achieve this kind of packing-performance. Probably the most important and essential aspect of 4k-64k intros is the compressor, and surprisingly, 4k intros have been well equipped for the past five years, as Crinkler is the best compressor developed so far for this category. It has been created by Blueberry (Loonies) and Mentor (tbc), two of the greatest demomakers around.

Last year, I started to learn a bit more about the compression technique used in Crinkler. It started from some pouet's comments that intrigued me, like "crinkler needs several hundred of mega-bytes to compress/decompress a 4k intros" (wow) or "when you want to compress an executable, It can take hours, depending on the compressor parameters"... I observed also bad comrpession result, while trying to convert some part of C++ code to asm code using crinkler... With this silly question, I realized that in order to achieve better compression ratio, you better need a code that is comrpession friendly but is not necessarily smaller. Or in other term, the smaller asm code is not always the best candidate for better compression under crinkler... so right, I needed to understand how crinkler was working in order to code crinkler-friendly code...

I just had a basic knowledge about compression, probably the last book I bought about compression was more than 15 years ago to make a presentation about jpeg compression for a physics courses (that was a way to talk about computer related things in a non-computer course!)... I remember that I didn't go further in the book, and stopped just before arithmetic encoding. Too bad, that's exactly one part of crinkler's compression technique, and has been widely used for the past few years (and studied for the past 40 years!), especially in compressors like H.264!

So wow, It took me a substantial amount of time to jump again on the compressor's train and to read all those complicated-statistical articles to understand how things are working... but that was worth it! In the same time, I spent a bit of my time to dissect crinkler's decompressor, extract the code decompressor in order to comment it and to compare its implementation with my little-own-test in this field... I had a great time to do this, although, in the end, I found that whatever I could do, under 4k, Crinkler is probably the best compressor ever.

You will find here an attempt to explain a little bit more what's behind Crinkler. I'm far from being a compressor expert, so if you are familiar with context-modeling, this post may sounds a bit light, but I'm sure It could be of some interest for people like me, that are discovering things like this and want to understand how they make 4k intros possible!

Crinkler main principles

If you want a bit more information, you should have a look at the "manual.txt" file in the crinkler's archive. You will find here lots of valuable information ranging from why this project was created to what kind of options you can setup for crinkler. There is also an old but still accurate and worth to look at powerpoint presentation from the author themselves that is available here.

First of all, you will find that crinkler is not strictly speaking an executable compressor but is rather an integrated linker-compressor. In fact, in the intro dev tool chain, It's used as part of the building process and is used inplace of your traditional linker.... while crinkler has the ability to compress its output. Why crinkler is better suited at this place? Most notably because at the linker level, crinkler has access to portions of your code, your data, and is able to move them around in order to achieve better compression. Though, for this choice, I'm not completely sure, but this could be also implemented as a standard exe compressor, relying on relocation tables in the PE sections of the executable and a good disassembler like beaengine in order to move the code around and update references... So, crinkler, cr-linker, compressor-linker, is a linker with an integrated compressor.

Secondly, crinkler is using a compression method that is far more aggressive and efficient than any old dictionary-coder-LZ methods : it's called context modeling coupled with an arithmetic coder. As mentioned in the crinkler's manual, the best place I found to learn about this was Matt Mahoney resource website. This is definitely the place to start when you want to play with context modeling, as there are lots of sourcecode, previous version of PAQ program, from which you can learn gradually how to build such a compressor (more particularly in earlier version of the program, when the design was still simple to handle). Building a context-modelling based compressor/decompressor is almost accessible from any developer, but one of the strength of crinkler is its decompressor size : around 210-220 bytes, which makes it probably the most efficient and smaller context-modelling decompressor in the world. We will see also that crinkler made one of the simplest choice for a context-modelling compressor, using a semi-static model in order to achieve better compression for 4k of datas, resulting in a less complex decompressor code as well.

Lastly, crinkler is optimizing the usage of the exe-PE file (which is the Windows Portable Executable format, the binary format of the a windows executable file, official description is available here). Mostly by removing the standard import table and dll loading in favor of a custom loader that exploit internal windows structure as well as storing function hashing in the header of the PE files to recover dll functions.

Compression method

Arithmetic coding

The whole compression problem in crinkler can be summarized like this: what is the probability of the next bit to compress/decompress to be 1? The better is the probability (meaning by matching the expecting result bit), the better is the compression ratio. Hence, Crinkler needs to be a little bit psychic?!

First of all, you probably wonder why probability is important here. This is mainly due to one compression technique called arithmetic coding. I won't go into the detail here and encourage the reader to read about the wikipedia article and related links. The main principle of arithmetic coding is its ability to encode into a single number a set of symbols for which you know their probability to occur. The higher the probability is for a known symbol, the lower the number of bits will be required to encode its compressed counterpart.

At the bit level, things are getting even simpler, since the symbols are only 1 or 0. So if you can provide a probability for the next bit (even if this probability is completely wrong), you are able to encode it through an arithmetic coder.

A simple binary arithmetic coder interface could look like this:

/// Simple ArithmeticCoder interface
class ArithmeticCoder {

   /// Decode a bit for a given probability.
   /// Decode returns the decoded bit 1 or 0
   int Decode(Bitstream inputStream, double probabilityForNextBit);

   /// Encode a bit (nextBit) with a given probability
   void Encode(Bitstream outputStream, int nextBit, double probabilityForNextBit);
}

And a simple usage of this ArithmeticCoder could look like this:

// Initialize variables
Bitstream inputCompressedStream = ...;
Bitstream outputStream = ...;
ArithmeticCoder coder;
Context context = ...;

// Simple decoder implem using an arithmetic coder
for(int i = 0; i < numberOfBitsToDecode; i++) { 
    // Made usage of our psychic alias Context class
    double nextProbability = context.ComputeProbability(); 

    // Decode the next bit from the compressed stream, based on this 
    // probability 
    int nextBit = coder.Decode( inputCompressedStream, nextProbability); 

    // Update the psychic and tell him, how much wrong or right he was! 
    context.UpdateModel( nextBit, nextProbability); 

    // Output the decoded bit 
    outputStream.Write(nextBit); 
}

So a Binary Arithmetic Coder is able to compress a stream of bits, if you are able to tell him what's the probability for the next bit in the stream. Its usage is fairly simple, although their implementations are often really tricky and sometimes quite obscure (a real arithmetic implementation should face lots of small problems : renormalization, underflow, overflow...etc.).

Working at the bit level here wouldn't have been possible 20 years ago, as It requires a tremendous amount of CPU (and memory for the psychic-context) in order to calculate/encode a single bit, but with nowadays computer power, It's less a problem... Lots of implem are working at the byte level for better performance, some of them can work at the bit level while still batching the decoding/encoding results at the byte level. Crinkler doesn't care about this and is working at the bit level, making the arithmetic decoder in less than 20 x86 ASM instructions.

The C++ pseudo-code for an arithmetic decoder is like this:

int ArithmeticCoder::Decode(Bitstream inputStream, double nextProbability) {
    int output = 0; // the decoded symbol

    // renormalization
    while (range < 0x80000000) {
        range <<= 1; 
        value <<= 1;
        value += inputStream.GetNextBit();
    }

    unsigned int subRange = (range * nextProbability);
    range = range - subRange;
    if (value >= range) { // we have the symbol 1
        value = value - range;
        range = subRange;
        output++;     // output = 1
    }

return output;
}

This is almost exactly what is used in crinkler, but this done in only 18 asm instructions! The crinkler arithmetic coder is using a 33 bit precision. The decoder only needs to handle up to 0x80000000 limit renormalization while the encoder needs to work on 64 bit to handle the 33 bit precision. This is much more convenient to work at this precision for the decoder, as it is able to easily detect renormalization (0x80000000 is in fact a negative number. The loop could have been formulated like while (range >= 0), and this is how it is done in asm).

So the arithmetic coder is the basic component used in crinkler. You will find plenty of arithmetic coder examples on Internet. Even if you don't fully understand the theory behind them, you can use them quite easily. I found for example an interesting project called flavor, which provides a tool to produce some arithmetic coders code based on a formal description (For example, a 32bit precision arithmetic coder description in flavor), pretty handy to understand how things are translated from different coder behaviors.

But, ok, the real brain here is not the arithmetic coder... but the psychic-context (the Context class above) which is responsible to provide a probability and to update its model based on the previous expectation. This is where a compressor is making the difference.

Context modeling - Context mixing

This is one great point about using an arithmetic coder: they can be decoupled from the component responsible to provide the probability for the next symbol. This component is called a context-modeling.

What is the context? It is whatever data can help your context-modeler to evaluate the probability for the next symbol to occur. Thus, the most obvious data for a compressor-decompressor is to use previous decoded data to update its internal probability table.

Suppose you have the following sequence of 8 bytes 0x7FFFFFFF,0xFFFFFFFF that is already decoded. What will be the next bit? It is certainly to be a 1, and you could bet on it as high as 98% of probability.

So this is not a surprise that using history of data is the key point for the context modeler to predict next bit (and well, we have to admit that our computer-psychic is not as good as he claims, as he needs to know the past to predict the future!).

Now that we know that to produce a probability for the next bit, we need to use historic data, how crinkler is using them? Crinkler is in fact maintaining a table of probability, up to 8 bytes + the current bits already read before the next bit. In the context-modeling jargon, it's often called the order (before context modeling, there was technique developped like PPM for Partial Predition Matching and DMC for dynamic markov compression). But crinkler is using not only the last x bytes (up to 8), but sparse mode (as it is mentioned in PAQ compressors), a combination of the last 8 bytes + the current bits already read. Crinkler calls this a model: It is stored into a single byte :

The 0x00 model says that It doesn't use any previous bytes other than the current bits being read.
The 0x80 model says that it is using the previous byte + the current bits being read.
The 0x81 model says that is is using the previous byte and the -8th byte + the current bits being read.
The 0xFF model says that all 8 previous bytes are used

You probably don't see yet how this is used. We are going to take a simple case here: Use the previous byte to predict the next bit (called the model 0x80).

Suppose the sequence of datas :


0xFF, 0x80, 0xFF, 0x85, 0xFF, 0x88, 0xFF, ???nextBit???
         (0)         (1)         (2)    (3) | => decoder position

At position 0, we know that 0xFF is followed by bit 1 (0x80 <=> 10000000b). So n0 = 0, n1 = 1 (n0 denotes the number of 0 that follows 0xFF, n1 denotes the number of 1 that usually follows 0xFF)
At position 1, we know that 0xFF is still followed by bit 1: n0 = 0, n1 = 2
At position 2, n0 = 0, n1 = 3
At position 3, we have n0 = 0, n1 = 3, making the probability for one p(1) = (n1 + eps) / ( n0+eps + n1+eps). eps for epsilon, lets take 0.01. We have p(1) = (2+0.01)/(0+0.01 + 2+0.01) = 99,50%

So we have the probability of 99,50% at position (3) that the next bit is a 1.

The principle here is simple: For each model and an historic value, we associate n0 and n1, the number of bits found for bit 0 (n0) and bit 1 (n1). Updating those n0/n1 counters needs to be done carefully : a naive approach would be to increment according values when a particular training bit is found... but there is more chance that recent values are more relevant than olders.... Matt Mahoney explained this in The PAQ1 Data Compression Program, 2002. (Describes PAQ1), and describes how to efficiently update those counters for a non-stationary source of data :

If the training bit is y (0 or 1) then increment ny (n0 or n1).
If n(1-y) > 2, then set n(1-y) = n(1-y) / 2 + 1 (rounding down if odd).

Suppose for example that n0 = 3 and n1 = 4 and we have a new bit 1. Then n0 will be = n0/2 + 1 = 3/2+1=2 and n1 = n1 + 1 = 5

Now, we know how to produce a single probability for a single model... but working with a single model (for exemple, only the previous byte) wouldn't be enough to evaluate correctly the next bit. Instead, we need a way to combine different models (different selection of historic data). This is called context-mixing, and this is the real power of context modeling: whatever is your method to collect and calculate a probability, you can, at some point, mix severals estimator to calculate a single probability.

There are several ways to mix those probabilities. In the pure context-modeling jargon, the model is the way you mix probabilities and for each model, you have a weight :

static: you determine the weights whatever the data are.
semi-static: you perform a 1st pass over the data to compress to determine the weights for each model, and them a 2nd pass with the best weights
adaptive: weights are updated dynamically as new bits are discovered.

Crinkler is using a semi-static context-mixing but is somewhat also "semi-adaptive", because It is using different weights for the code of your exe, and the data of your exe, as they have a different binary layout.

So how this is mixed-up? Crinkler needs to determine the best context-models (the combination of historic data) that It will use, assign for each of those context a weight. The weight is then used to calculate the final probability.

For each selected historic model (i) with an associated model weight wi, and ni0/ni1 bit counters, the final probability p(1) is calculated like this :

p(1) = Sum(  wi * ni1 / (ni0 + ni1))  / Sum ( wi )

This is exactly what is done in the code above for context.ComputeProbability();, and this is exactly what crinkler is doing.

In the end, crinkler is selecting a list of models for each type of section in your exe: a set of models for the code section, a set of models for the data section.

How many models crinkler is selecting? It depends on your data. For example, for ergon intro,crinklers is selecting the following models:

For the code section:
           0    1    2    3    4    5    6    7    8    9   10   11   12   13 
Model  {0x00,0x20,0x60,0x40,0x80,0x90,0x58,0x4a,0xc0,0xa8,0xa2,0xc5,0x9e,0xed,}
Weight {   0,   0,   0,   1,   2,   2,   2,   2,   3,   3,   3,   4,   6,   6,}

For the data section:
           0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19 
Model  {0x40,0x60,0x44,0x22,0x08,0x84,0x07,0x00,0xa0,0x80,0x98,0x54,0xc0,0xe0,0x91,0xba,0xf0,0xad,0xc3,0xcd,}
Weight {   0,   0,   0,   0,   0,   0,   0,   1,   1,   2,   2,   2,   3,   3,   3,   4,   4,   4,   4,   5,}

(note that in crinkler, the final weight used to multiply n1/n0+n1 is by 2^w, and not wi itself).

Wow, does it means that crinkler needs to store those datas in your exe. (14 bytes + 20 bytes) * 2 = 68 bytes? Well, crinkler authors are smarter than this! In fact the models are stored, but weights are only store in a single int (32 bits for each section). Yep, a single int to stored those weights? Indeed: if you look at those weights, they are increasing, sometimes they are equal... So they found a clever way to store a compact representation of those weights in a 32 bit form. Starting with a weight of 1, the 32bit weight is shifted by one bit to the left : If this is 0, than the currentWeight doesn't change, if bit is 1, than currentWeight is incremented by 1 : (in this pseudo-code, shift is done to the right)

int currentWeight = 1;
int compactWeight = ....;
foreach (model in models) {
  if ( compactWeight & 1 )
    currentWeigh++;
  compactWeight =  compactWeight >> 1;

//  ... used currentWeight for current model
}

This way, crinkler is able to store a compact form of pairs (model/weight) for each type of data in your executable (code or pure data).

Model selection

Model selection is one of the key process of crinkler. For a particular set of datas, what is the best selection of models? You start with 256 models (all the combinations of the 8 previous bytes) and you need to determine the best selection of models. You have to take into account that each time you are using a model, you need to use 1 byte in your final executable to store this model. Model selection is part of crinkler compressor but is not part of crinkler decompressor. The decompressor just need to know the list of the final models used to compress the data, but doesn't care about intermediate results. On the other hand, the compressor needs to test every combination of model, and find an appropriate weight for each model.

I have tested several methods in my test code and try to recover the method used in crinkler, without achieving comparable compression ratio... I tried some brute force algo without any success... The selection algorithm is probably a bit clever than the one I have tested, and would probably require to layout mathematics/statistics formulas/combination to select an accurate method.

Finally, blueberry has given their method (thanks!)

"To answer your question about the model selection process, it is actually not very clever. We step through the models in bit-mirrored numerical order (i.e. 00, 80, 40, C0, 20 etc.) and for each step do the following:

- Check if compression improves by adding the model to the current set of models (taking into account the one extra byte to store the model).

- If so, add the model, and then step through every model in the current set and remove it if compression improves by doing so.

The difference between FAST and SLOW compression is that SLOW optimizes the model weights for every comparison between model sets, whereas FAST uses a heuristic for the model weights (number of bits set in the model mask). "

On the other hand, I tried a fully adaptive context modelling approach, using dynamic weight calculation explained by Matt Mahoney with neural networks and stretch/squash functions (look at PAQ on wikipedia). It was really promising, as I was able to achieve sometimes better compression ratio than crinkler... but at the cost of a decompressor 100 bytes heavier... and even I was able to save 30 to 60 bytes for the compressed data, I was still off by 40-70 bytes... so under 4k, this approach was definitely not as efficient as a semi-static approach chosen by crinkler.

Storing probabilities

If you have correctly followed the previous model selection, crinkler is now working with a set of models (selection of history data), for each bit that is found, each model probabilities must be updated...

But think about it: for example, if to predict the following bit, we are using the probabilities for the 8 previous bytes, it means that for every combination of 8 bytes already found in the decoded data, we would have a pair of n0/n1 counters?

That would mean that we could have the folowing probabilities to update for the context 0xFF (8 previous bytes):
- "00 00 00 00 c0 00 00 50 00" => some n0/n1
- "00 00 70 00 00 00 00 F2 01" => another n0/n1
- "00 00 00 40 00 00 00 30 02" => another n0/n1
...etc.

and if we have other models like 0x80 (previous byte), or 0xC0 (the last 2 previous bytes), we would have also different counters for them:

// For model 0x80

- "00" => some n0/n1

- "01" => another n0/n1

- "02" => yet another n0/n1

...

// For model 0xC0

- "50 00" => some bis n0/n1

- "F2 01" => another bis n0/n1

- "30 02" => yet another bis n0/n1

...

From the previous model context, I have slightly over simplified the fact that not only the previous bytes is used, but also the current bits being read. In fact, when we are using for example the model 0x80 (using the previous byte), the context of the historic data is composed not only by the previous byte, but also by the bits being read on the current octet. This implies obviously that for every bit read, there is a different context. Suppose we have the sequence 0x75, 0x86 (in binary 10000110b), the position of the encoded bits is just after the 0x75 value and that we are using the previous byte + the bits currently read:

First, we start on a byte boundary
- 0x75 with 0 bit (we start with 0) is followed by bit 1 (the 8 of 0x85). The context is 0x75 + 0 bit read
- We read one more bit, we have a new context : 0x75 + bit 1. This context is followed by a 0
- We read one more bit, we have a new context : 0x75 + bit 10. This context is followed by a 0.
...
- We read one more bit, we have a new context : 0x75 + bit 1000011, that is followed by a 0 (and we are ending on a byte boundary).

Reading 0x75 followed by 0x86, with a model using only the previous byte, we finally have 8 context with their own n0/n1 to store in the probability table.

As you can see, It is obvious that It's difficult to store all context found (.i.e for each single bit decoded, there is a different context of historic bytes) and their respective exact probability counters, without exploding the RAM. Moreover if you think about the number of models that are used by crinkler: 14 types of different historic previous bytes selection for ergon's code!

This kind of problem is often handled using a hashtable while handling collisions. This is what is done in some of the PAQ compressors. Crinkler is also using an hashtable to store counter probabilities, with the association context_history_of_bytes = > (n0/n1), but It is not handling collision in order to keep minimal the size of the decompressor. As usual, the hash function used by crinkler is really tiny while still giving really good results.

So instead of having the association between context_history_of_bytes => n0/n1, we are using a hashing function, hash(context_history_of_bytes) => n0/n1. Then, the dictionary that is storing all those associations needs to be correctly dimensioned, large enough, to store as much as possible associations found while decoding/encoding the data.

Like in PAQ compressors, crinkler is using one byte for each counter, meaning that n0 and n1 together are taking 16 bit, 2 bytes. So if you instruct crinkler to use a hashtable of 100Mo, It will be possible to store 50 millions of different keys, meaning different historic context of bytes and their respective probability counters. There is a little remark about crinkler and the byte counter: in PAQ compressors, limits are handled, meaning that if a counter is going above 255, It will stuck to 255... but crinkler made the choice to not test the limits in order to keep the code smaller (although, that would take less than 6 bytes to test the limit). What is the impact of this choice? Well, if you know crinkler, you are aware that crinkler doesn't handle large section of "zeros" or whatever empty initialized data. This is just because the probabilities are looping from 255 to 0, meaning that you jump from a 100% probability (probably accurate) to almost a 0% probability (probably wrong) every 256 bytes. Is this really hurting the compression? Well, It would hurt a lot if crinkler was used for larger executable, but in a 4k, It's not hurting so much (although, It could hurt if you really have large portions of initialized data). Also, not all the context are reseted at the same time (a 8 byte context will not probably reset as often as a 1 byte context), so it means that final probability calculation is still accurate... while there is a probability that is reseted, other models with their own probabilities are still counting there... so this is not a huge issue.

What happens also if the hash for a different context is giving the same value? Well, the model is then updating the wrong probability counters. If the hashtable is too small the probability counters may really be too much disturbed and they would provide a less accurate final probability. But if the hashtable is large enough, collisions are less likely to happen.

Thus, it is quite common to use a hashtable as large as 256 to 512Mo if you want, although 256Mo is often enough, but the larger is your hashtable, the less are collisions, the more accurate is your probability. Recall from the beginning of this post, and you should understand now why "crinkler can take several hundreds of megabytes to decompress"... simply because of this hashtable that store all the probabilities for the next bit for all models combination used.

If you are familiar with crinkler, you already know the option to find a best possible hashsize for an initial hashtable size and a number of tries (hashtries option). This part is responsible to test different size of hashtable (like starting from 100Mo, and reducing the size by 2 bytes 30 times, and test the final compression) and test final compression result. This is a way to empirically reduce collision effects by selecting the hashsize that is giving the better compression ratio (meaning less collisions in the hash). Although this option is only able to help you save a couple of bytes, no more.

Data reordering and type of data

Reordering or organizing differently the data to have a better compression is one of the common technique in compression methods. Sometimes for example, Its better to store deltas of values than to store values themselves...etc.

Crinkler is using this principle to perform data reordering. At the linker level, crinkler has access to portion of datas and code, and is able to move those portions around in order to achieve a better compression ratio. This is really easy to understand : suppose that you have a series initialized zero values in your data section. If those values are interleaved with non zero values, the counter probabilities will switch from "there are plenty of zero there" to "ooops, there are some other datas"... and the final probability will balance between 90% to 20%. Grouping data that are similar is a way to improve the overall probability correctness.

This part is the most time consuming, as It needs to move and arrange all portions of your executable around, and test which arrangement is giving the best compression result. But It's paying to use this option, as you may be able to save 100 bytes in the end just with this option.

One thing that is also related to data reordering is the way crinkler is handling separately the binary code and the data of your executable. Why?, because their binary representation is different, leading to a completely different set of probabilities. If you look at the selected models for ergon, you will find that code and data models are quite different. Crinkler is using this to achieve better performance here. In fact, crinkler is compressing completely separately the code and the datas. Code has its own models and weights, Data another set of models and weights. What does it means internally? Crinkler is using a set of model and weights to decode the code section of your exectuable. Once finished, It will erase the probability counters stored in the hashtable-dictionary, and go to the data section, with new models and weights. Reseting all counters to 0 in the middle of decompressing is improving compression by a factor of 2-4%, which is quite impressive and valuable for a 4k (around 100 to 150 bytes).

I found that even with an adaptive model (with a neural networks dynamically updating the weights), It is still worth to reset the probabilities between code and data decompression. In fact, reseting the probabilities is an empirical way to instruct the context modeling that datas are so different that It's better to start from scratch with new probability counters. If you think about it, an improved demo compressor (for larger exectuable, for example under 64k) could be clever to detect those portions of datas that are enough different that It would be better to reset the dictionary than to keep it as it is.

There is just one last thing about weights handling in crinkler. When decoding/encoding, It seems that crinkler is artificially increasing the weights for the first discovered bit. This little trick is improving compression ratio by about 1 to 2% which is not bad. Having higher weights at the beginning enable to have a better response of the compressor/decompressor, even If it doesn't still have enough data to compute a correct probability. Increasing the weights is helping the compression ratio at cold start.

Crinkler is also able to transform the x86 code for the executable part to improve compression ratio. This technique is widely used and consist of replacing relative jump (conditionnal, function calls...etc.) to absolute jump, leading to a better compression ratio.

Custom DLL LoadLibrary and PE file optimization

In order to strip down the size of an executable, It's necessary to exploit as much as possible the organization of a PE file.

First thing that crinkler is using is that lots of part in a PE files are not used at all. If you want to know how a windows executable PE files can be reduced, I suggest you read Tiny PE article, which is a good way to understand what is actually used by a PE loader. Unlike the Tiny PE sample, where the author is moving the PE header to the dos header, crinkler made the choice to use this unused place to store hash values that are used to reference DLL functions used.

This trick is called import by hashing and is quite common in intro's compressor. Probably what make crinkler a little bit more advanced is that to perform the "GetProcAddress" (which is responsible to get the pointer to a function from a function name), crinkler is navigating inside internal windows process structure in order to directly get the address of the functions from the in-memory import table. Indeed, you won't find any import section table in a crinklerized executable. Everything is re-discovered through internal windows structures. Those structures are not officially documented but you can find some valuable information around, most notably here.

If you look at crinkler's code stored in the crinkler import section, which is the code injected just before the intros start, in order to load all dll functions, you will find those cryptics calls like this:

//
    (0) MOV         EAX, FS:[BX+0x30]
    (1) MOV         EAX, [EAX+0xC]
    (2) MOV         EAX, [EAX+0xC]
    (3) MOV         EAX, [EAX]
    (4) MOV         EAX, [EAX]
    (5) MOV         EBP, [EAX+0x18]

This is done by going through internal structures:

(0) first crinklers gets a pointer to the "PROCESS ENVIRONMENT BLOCK (PEB)" with the instruction MOV EAX, FS:[BX+0x30]. EAX is now pointing to the PEB

Public Type PEB

InheritedAddressSpace As Byte
    ReadImageFileExecOptions As Byte
    BeingDebugged As Byte
    Spare As Byte
    Mutant As Long
    SectionBaseAddress As Long
    ProcessModuleInfo As Long ‘ // <---- PEB_LDR_DATA
    ProcessParameters As Long ‘ // RTL_USER_PROCESS_PARAMETERS
    SubSystemData As Long
    ProcessHeap As Long
    ... struct continue

(1) Then it gets a pointer to the "ProcessModuleInfo/PEB_LDR_DATA" MOV EAX, [EAX+0xC]

Public Type _PEB_LDR_DATA
    Length As Integer
    Initialized As Long
    SsHandle As Long
    InLoadOrderModuleList As LIST_ENTRY  // <---- LIST_ENTRY InLoadOrderModuleList
    InMemoryOrderModuleList As LIST_ENTRY
    InInitOrderModuleList As LIST_ENTRY
    EntryInProgress As Long
End Type

(2) Then it gets a pointer to get a pointer to the next "InLoadOrderModuleList/LIST_ENTRY" MOV EAX, [EAX+0xC].

Public Type LIST_ENTRY    Flink As LIST_ENTRY
    Blink As LIST_ENTRY
End Type

(3) and (4) Then it navigates through the LIST_ENTRY linked list MOV EAX, [EAX]. This is done 2 times. First time, we get a pointer to the NTDLL.dll, second with get a pointer to the KERNEL.DLL. Each LIST_ENTRY is in fact followed by the structure LDR_MODULE :

Public Type LDR_MODULE
    InLoadOrderModuleList As LIST_ENTRY
    InMemoryOrderModuleList As LIST_ENTRY
    InInitOrderModuleList As LIST_ENTRY
    BaseAddress As Long
    EntryPoint As Long
    SizeOfImage As Long
    FullDllName As UNICODE_STRING
    BaseDllName As UNICODE_STRING
    Flags As Long
    LoadCount As Integer
    TlsIndex As Integer
    HashTableEntry As LIST_ENTRY
    TimeDateStamp As Long
    LoadedImports As Long
    EntryActivationContext As Long ‘ // ACTIVATION_CONTEXT
    PatchInformation As Long
End Type

Then from the BaseAddress of the Kernel.dll module, crinkler is going to the section where functions are already loaded in memory. From there, the first hashed function that is stored by crinkler is LoadLibrary function. After this, crinkler is able to load all the depend dll and navigate through the import tables, recomputing the hash for all functions names for dependent dlls, and is trying to match the hash stored in the PE header. If a match is found, then the function entry point is stored.

This way, crinkler is able to call some OS functions stored in the Kernel.DLL, without even linking explicitly to those DLL, as they are automatically loaded whenever a DLL is loaded. Thus achieving a way to import all functions used by an intro with a custom import loader.

Compression results

So finally, you may ask, how much crinkler is good at compressing? How does it compare to other compression method? How does look like the entropy in a crinklerized exe?

I'll take the example of Ergon exe. You can already find a detailed analysis for this particular exe.

Comparison with other compression methods

In order to make a fair comparison between crinkler and other compressors, I have used the data that are actually compressed by crinkler after the reordering of code and data (This is done by unpacking a crinklerized ergon.exe and extracting only the compressed data). This comparison is accurate in that all compressors are using exactly the same data.

In order also to be fair with crinkler, the size of 3652 is not taking into account the PE header + the crinkler decompressor code (which in total is 432 bytes for crinkler).

To perform this comparison, I have only used 7z which has at least 3 interesting methods to test against :

Standard Deflate Zip
PPMd with 256Mo of dictionary
LZMA with 256Mo of dictionary

I have also included a comparison with a more advanced packing method from Matt Mahoney resource, Paq8l which is one of the version of PAQ methods, using neural networks and several context modeling methods.

Program	Compression Method	Size in bytes	Ratio vs Crinkler
none	uncompressed	9796
crinkler	ctx-model 256Mo	3652	+0,00%
7z	deflate 32Ko	4526	+23,93%
7z	PPMd 256Mo	4334	+18,67%
7z	LZMA 256Mo	4380	+19,93%
Paq8l	dyn-ctx-model 256Mo	3521	-3,59%

As you can see, crinkler is far more efficient than any of the "standard" compression method (Zip, PPMd, LZMA). I'm not even talking about the fact that a true comparison would be to include the decompressor size, so the ratio should certainly be worse for all standard methods!

Paq8l is of course slightly better... but if you take into account that Paq8l decompressor is itself an exe of 37Ko... compare to the 220 byte of crinkler... you should understand now how much crinkler is highly efficient in its own domain! (remember? 4k!)

Entropy

In order to measure the entropy of crinkler, I have developed a very small program in C# that is displaying the entropy of an exe. From green color (low entropy, less bits necessary to encode this information) to red color (high entropy, more bits necessary to encode this information).

I have done this on 3 different ergon executable :

The uncompressed ergon.exe (28Ko). It is the standard output of a binary exe with MSVC++ 2008.
The raw-crinklerized ergon.exe extracted code and data section, but not compressed (9796 bytes)
The final crinklerized ergon.exe file (4070 bytes)

Ergon standard exe entropy

Ergon code and data crinklerized, uncompressed reordered data

Ergon executable crinklerized

As expected, the entropy is fairly massive in a crinklerized exe. Compare with the waste of information in a standard windows executable. Also, you can appreciate how much is important the reordering and packing of data (no compression) that is perform by crinkler.

Some notes about the x86 crinkler decompressor asm code

I have often talked about how much crinkler decompressor is truly a piece of x86 art. It is hard to describe the technique used here, there are lots of x86 standard optimization and some really nice trick. Most notably:

using all the registers
using intensively the stack to save/restore all the registers with pushad/popad x86. This is for example done (1 + number_of_model) per bit. If you have 15 models, there will be a total of 16 pushad/popad instructions for a single bit to be decoded! You may wonder why making so many pushes? Its the only way to efficiently use all the registers (rule #1) without having to store particular registers in a buffer. Of course, push/pop instruction is also used at several places in the code as well.
As a result of 1) and 2), apart from the hash dictionnary, no intermediate structure are used to perform the context modeling calculation.
Deferred conditional jump: Usually, when you perform some conditional testing with x86, this is often immediately followed by a conditional jump (like cmp eax, 0; jne go_for_bla). In crinkler, sometimes, a conditionnal test is done, and is used several instruction laters. (for example. cmp eax,0; push eax; mov eax, 5; jne go_for_bla <---- this is using the result of cmp eax,0 comparison). It makes the code to read a LOT harder. Sometimes, the conditional is even used after a direct jump! This is probably one part of crinkler's decompressor that impressed me the most. This is of course something quite common if you are programming heavily optimized-size x86 asm code... you need to know of course which instructions is not modifying CPU flags in order to achieve this kind of optimization!

Final words

I would like to apologize for the lack of charts, pictures to explain a little bit how things are working. This article is probably still obscure for a casual reader, and should be considered as a draft version. This was a quick and dirty post. I wanted to write this for a long time, so here it is, not perfect as it should be, but this may be improved in future versions!

As you can see, crinkler is really worth to look at. The effort to make it so efficient is impressive and there is almost no doubt that there won't be any other crinkler competitor for a long time! At least for a 4k executable. Above 4k, I'm quite confident that there are still lots of area that could be improved, and probably kkrunchy is far from being the ultimate packer under 64k... Still, if you want a packer, you need to code it, and that's not so trivial!

Official release of SharpDX 1.0

2010-12-01T00:17:00.006+11:00

After three months of intense development, I'm really excited to announce the availability of SharpDX 1.0 , a new platform independent .Net managed DirectX API, directly generated from DirectX SDK headers.

This first version can be considered as stable. The Direct3D10 / Direct3D10.1 API has been entirely tested on a large 3D engine that was using previously SlimDX (thanks patapom!). Migration was quite straightforward, with tiny minor changes to the engine's code.

The key features and benefits of this new API are:

API is generated from DirectX SDK headers : meaning a complete and reliable API and an easy support for future API.
Full support for the following DirectX API:

Direct3D10
Direct3D10.1
Direct3D11
Direct2D1 (including custom rendering, tessellation callbacks)
DirectWrite (including custom client callbacks)
D3DCompiler
DXGI
DXGI 1.1
DirectSound
XAudio2
XAPO
An integrated math API directly ported from SlimMath

Pure managed .NET API, platform independent : assemblies are compiled with AnyCpu target. You can run your code on a x64 or a x86 machine with the same assemblies, without recompiling your project.
Lightweight individual assemblies : a core assembly - SharpDX - containing common classes and an assembly for each subgroup API (Direct3D10, Direct3D11, DXGI, D3DCompiler...etc.). Assemblies are also lightweight.
C++/CLI Speed : the framework is using a genuine way to avoid any C++/CLI while still achieving comparable performance.
API naming convention mostly compatible with SlimDX API.
Raw DirectX object life management : No overhead of ObjectTable or RCW mechanism, the API is using direct native management with classic COM method "Release".
Easily mergeable / obfuscatable : If you need to obfuscate SharpDX assemblies, they are easily obfusctable due to the fact the framework is not using any mixed assemblies. You can also merge SharpDX assemblies into a single exe using with tool like ILMerge.

You will also find a growing collection of samples in the Samples Gallery of SharpDX. Most notably with some additional support for Direct2D1 and DirectWrite client callbacks.

Instead of providing a monolithic assembly, SharpDX is providing lightweight individual and interdependent assemblies. All SharpDX assemblies are dependent from the core SharpDX assembly. You just need to add the required assemblies to your project, without embedding the whole DirectX API stack. Here is a chart that explains SharpDX assembly dependencies:

Next versions will provide support for DirectInput, XInput, X3DAudio, XACT3.

About performance

Someone asked me how SharpDX compares to SlimDX in terms of performance. Here is a micro-benchmark on two methods, ID3D10Device1::GetFeatureLevel (alias Device.FeatureLevel) and ID3D10Device::CheckCounterInfo (alias Device.GetCounterCapabilities).

The test consist of 100,000,000 calls on each methods (inside a for, with (10 calls to device.FeatureLevel) * 10,000,000 times) and is repeated 10 times and averaged. Repeated two times.

Method	SlimDX	SharpDX	SharpDX vs SlimDX
device.FeatureLevel	3700	3650	1,37%
device.GetCounterCapabilities()	4684	4259	9,98%

For FeatureLevel, the test was sometimes around +/-0.5%.
For GetCounterCapabilities(), the main difference between SharpDX and SlimDX implementation is that SlimDX perform a copy from the native struct to .Net struct while SharpDX is directly passing a pointer to the .Net struct.

This test is of course a micro benchmark and doesn't reflect a real-world usage. Some part of the API could be in favor of SlimDX, but I'm pretty confident that SharpDX is much more consistent in the way structures are passed to the native functions, avoiding as much as possible marshaling structures that doesn't need any custom marshaling (unlike SlimDX that is performing most of a time a marshaling between .Net/Native structure, besides they are binary compatible).

Next?

Finally, I'm going to be able to use this project to make some demos with it! Next target is to develop a XNA like based framework based on SharpDX.Direct3D11.

Stay tuned!

SharpDX, a new managed .Net DirectX API available

2010-11-18T21:06:00.007+11:00

If you have followed my previous work on a new .NET API for Direct3D 11, I proposed SlimDX team this solution for the v2 of their framework, joined their team around one month ago, and I was actively working to widen the coverage of the DirectX API. I have been able to extend the API coverage almost up to the whole API, being able to develop Direct2D samples, as well as XAudio2 and XAPO samples using it. But due to some incompatible directions that the SlimDX team wanted to follow, I have decided to release also my work under a separate project called SharpDX. Now, you may wonder why I'm releasing this new API under a separate project from SlimDX?

Well, I have been working really hard on this from the beginning of September, and I explained why in my previous post about Direct3D 11. I have checked-in lots of code under the v2 branch on SlimDX, while having lots of discussion with the team (mostly Josh which is mostly responsible for v2) on their devel mailing list. The reason I'm leaving SlimDX team is that It was in fact not clear for me that I was not enrolled as part of the decision for the v2 directions, although I was bringing a whole solution (by "whole", I mean a large proof of concept, not something robust, finished). At some point, Josh told me that Promit, Mike and himself, co-founders of SlimDX, were the technical leaders of this project and they would have the last word on the direction as well as for decisions on the v2 API.

Unfortunately, I was not expecting to work in such terms with them, considering that I had already made 100% of the whole engineering prototype for the next API. From the last few days, we had lots of -small- technical discussions, but for some of them, I clearly didn't agree about the decisions that were taken, whatever the arguments I was trying to give to them. This is a bit of disappointment for me, but well, that's life of open source projects. This is their project and they have other plans for it. So, I have decided to release the project on my own with SharpDX although you will see that the code is also currently exactly the same on the v2 branch of SlimDX (of course, because until yesterday, I was working on the SlimDX v2 branch).

But things are going to change for both projects : SlimDX is taking the robust way (for which I agree) but with some decisions that I don't agree (in terms of implementation and direction). Although, as It may sound weird, SharpDX is not intended to compete with SlimDX v2 : They have clearly a different scope (supporting for example Direct3D 9, which I don't really care in fact), different target and also different view on exposing the API and a large existing community already on SlimDX. So SharpDX is primarily intended for my own work on demomaking. Nothing more. I'm releasing it, because SlimDX v2 is not going to be available soon, even for an alpha version. On my side, I'm considering that the current state (although far to be as clean as It should be) of the SharpDX API is usable and I'm going to use it on my own, while improving the generator and parser, to make the code safer and more robust.

So, I did lots of work to bring new API into this system, including :

Direct3D 10
Direct3D 10.1
Direct3D 11
Direct2D 1
DirectWrite
DXGI
DXGI 1.1
D3DCompiler
DirectSound
XAudio2
XAPO

And I have been working also on some nice samples, for example using Direct2D and Direct3D 10, including the usage of the tessellate Direct2D API, in order to see how well It works compared to the gluTessellation methods that are most commonly used. You will find that the code is extremely simple in SharpDX to do such a thing :

using System;
using System.Drawing;
using SharpDX.Direct2D1;
using SharpDX.Samples;

namespace TessellateApp
{
    /// 
    /// Direct2D1 Tessellate Demo.
    /// 
    public class Program : Direct2D1DemoApp, TessellationSink
    {
        EllipseGeometry Ellipse { get; set; }
        PathGeometry TesselatedGeometry{ get; set; }
        GeometrySink GeometrySink { get; set; }

        protected override void Initialize(DemoConfiguration demoConfiguration)
        {
            base.Initialize(demoConfiguration);

            // Create an ellipse
            Ellipse = new EllipseGeometry(Factory2D,
                                          new Ellipse(new PointF(demoConfiguration.Width/2, demoConfiguration.Height/2), demoConfiguration.Width/2 - 100,
                                                      demoConfiguration.Height/2 - 100));

            // Populate a PathGeometry from Ellipse tessellation 
            TesselatedGeometry = new PathGeometry(Factory2D);
            GeometrySink = TesselatedGeometry.Open();
            // Force RoundLineJoin otherwise the tesselated looks buggy at line joins
            GeometrySink.SetSegmentFlags(PathSegment.ForceRoundLineJoin); 

            // Tesselate the ellipse to our TessellationSink
            Ellipse.Tessellate(1, this);

            // Close the GeometrySink
            GeometrySink.Close();
        }


        protected override void Draw(DemoTime time)
        {
            base.Draw(time);

            // Draw the TextLayout
            RenderTarget2D.DrawGeometry(TesselatedGeometry, SceneColorBrush, 1, null);
        }

        void TessellationSink.AddTriangles(Triangle[] triangles)
        {
            // Add Tessellated triangles to the opened GeometrySink
            foreach (var triangle in triangles)
            {
                GeometrySink.BeginFigure(triangle.Point1, FigureBegin.Filled);
                GeometrySink.AddLine(triangle.Point2);
                GeometrySink.AddLine(triangle.Point3);
                GeometrySink.EndFigure(FigureEnd.Closed);                
            }
        }

        void TessellationSink.Close()
        {            
        }

        [STAThread]
        static void Main(string[] args)
        {
            Program program = new Program();
            program.Run(new DemoConfiguration("SharpDX Direct2D1 Tessellate Demo"));
        }
    }
}

This simple example is producing the following ouput :

which is pretty cool, considering the amount of code (although the Direct3D 10 and D2D initialization part would give a larger code), I found this to be much simpler than the gluTessellation API.

You will find also some other samples, like the XAudio2 ones, generating a synthesized sound with the usage of the reverb, and even some custom XAPO sound processors!

You can grab those samples on SharpDX code repository (there is a SharpDXBinAndSamples.zip with a working solutions with all the samples I have been developing so far, with also MiniTris sample from SlimDX).

Hacking Direct2D to use directly Direct3D 11 instead of Direct3D 10.1 API

2010-11-03T10:46:00.009+11:00

Disclaimer about this hack: This hack was nothing more than a proof of concept and I *really* don't have time to dig into any kind of bugs related to it.

[Edit]13 Jan 2011, After Windows Update KB2454826, this hack was not working. I have patched the sample to make it work again. Of course, you shouldn't consider this hack for anykind of production use. Use the standard DXGI shared sync keyed mutex instead. This hack is just for fun![/Edit]

If you know Direct3D 11 and Direct 2D - they were released almost at the same time - you already know that there is a huge drawback to use Direct 2D : It's in fact only working with Direct3D 10.1 API (although It's working with older hardware thanks to the new feature level capability of the API).

From a coding user point of view, this is really disappointing that such a good API doesn't rely on the latest Direct3D API... moreover when you know that the Direct3D 11 API is really close to the Direct3D 10.1 API... In the end, more work are required for a developer that would like to work with Direct3D 11, as It doesn't have any more Text API for example, meaning that in D3D11, you have to do it yourself, which isn't a huge task itself, if you go to the easy precalculated-texture-of-fonts generated by some GDI+ calls or whatever, but still... this is annoying specially when you need to display some information/FPS on the screen and you can't wait to build a nice font-texture-based system...

I'm not completely fair with Direct2D interoperability with Direct3D 11 : there is in fact a well known solution proposed by one guy from DirectX Team that imply the use of DXGI mutex to synchronized a surface shared between D3D10.1 and D3D11. I was expecting this issue to be solved in some DirectX SDK release this year, but It seems that there is no plan to release in the near future an update for Direct2D (see my question in the comments and the anwser...)... WP7 and XNA are probably getting much more attention here...

So last week, I took some time on the Direct2D API and found that It's in fact fairly easy to hack Direct2D and redirect all the D3D10.1 API calls to a real Direct3D 11 instance... and this is a pretty cool news! Here is the story of this little hack...

How Direct2D is accessing your already instantiated D3D10.1 device?

In order to use Direct2D with a renderable D3D10 texture2D, you need to query the IDXGISurface from your ID3D10Texture2D object, something like this:

IDXGISurface* surface;

// Create a Texture2D (or use SwapChain backbuffer)
d3d10Device->CreateTexture2D(&texture2DDesc, 0, &texture2D);

// Query the DXGI Surface associated with the D3D10.1 Texture2D
texture2D->QueryInterface(__uuidof(IDXGISurface), &surface);

// Create a D2D Render target from the D3D10 Texture2D through the associated DXGISurface
d2dFactory->CreateDxgiSurfaceRenderTarget(
        surface,
        &props,
        &d2dRenderTarget
        );

So starting from this CreateDxgiSurfaceRenderTarget call, Direct2D is somehow able to get back your D3D10.1 instance and is able to use it to submit drawcalls / create textures... etc. In order to find how Direct2D is getting an instance of ID3D10Device1, I have first implemented a Proxy IDXGISurface that was responsible to embed the real DXGI Surface and delegate all the calls for it...while being able to track down how Direct2D is getting back this ID3D10Device1 :

After the surface enters the CreateDxgiSurfaceRenderTarget, Direct2D is querying the IDXGIDevice through the GetDevice method on the IDXGISurface
From the IDXGIDevice, Direct2D is calling QueryInterface with the IID of the ID3D10Device interface (surprisingly not the ID3D10Device1)

And bingo! Being able to give your own implementation of the ID3D10Device to Direct2D... and you are able to redirect all the D3D10 calls to a Direct3D 11 device/context with a simple proxy implementing ID3D10Device1 methods!

Interoperability between D3D10.1 and D3D11 API

Migrating from D3D10/D3D10.1 to D3D11 API is quite straightforward and even have a dedicated paper on msdn. For the purpose of this quick hack, I didn't implement proxies for the whole D3D10 API... but I have instead focused my work on how is used the D3D10 API from D2D and what are the real methods/structures used that are not binary compatible between D3D10 and D3D11.

In the end, I have developped 5 proxies :

a Proxy for IDXGISurface interface, in order to hack the GetDevice method and return my own proxy for IDXGIDevice
a Proxy for IDXGIDevice interface in order to hack the QueryInterface method and return my own proxy for ID3D10Device1
a Proxy for the ID3D10Device1 interface
a Proxy for the ID3D10Texture2D interface
a Proxy for the ID3D10Buffer interface

For the ID3D10Device1 interface, most of the methods are redirecting the calls directly to the device (ID3D11Device) or context (ID3D11DeviceContext). I didn't bother to implement proxies for most of the parameters, because even if they are not always binary compatible, returned objects are only used as reference and are not called directly. Suppose for example the proxy implementation for VSGetShader (which is used by Direct2D for saving the D3D10 pipeline state) :

virtual void STDMETHODCALLTYPE VSGetShader( 
    /* [annotation] */ 
    __out  ID3D10VertexShader **ppVertexShader) { 
        context->VSGetShader((ID3D11VertexShader**)ppVertexShader, 0, 0);
}

A Real proxy would have to wrap the ID3D11VertexShader inside a ID3D10VertexShader proxy... but because Direct2D (and this is not a surprise) is only using VSGetShader to later call VSSetShader (in order to restore the saved states, or to set it's own vertex/pixel shaders), It doesn't call any method on the ID3D10VertexShader instance... meaning that we can give it back directly a ID3D11VertexShader without performing any - costly - conversion.

For instance, most of the ID3D10Device1 proxy methods are like the previous one, a simple redirection to a D3D11 Device or DeviceContext... easy!

I was only forced to implement custom proxies for some incompatible structures... or returned object instance that are effectively used by Direct2D (like ID3D10Buffer and ID3D10Texture2D).

For example, the ID3D10Device::CreateBuffer proxy methods is implemented like this :

virtual HRESULT STDMETHODCALLTYPE CreateBuffer( 
 /* [annotation] */ 
 __in  const D3D10_BUFFER_DESC *pDesc,
 /* [annotation] */ 
 __in_opt  const D3D10_SUBRESOURCE_DATA *pInitialData,
 /* [annotation] */ 
 __out_opt  ID3D10Buffer **ppBuffer) {  
  D3D11_BUFFER_DESC desc11;

  *((D3D10_BUFFER_DESC*)&desc11) = *pDesc;
  // StructureByteStride field is new in D3D11
  desc11.StructureByteStride = 0;

  // Returns our ID3D10Buffer proxy instead of the real one
  ProxyID3D10Buffer* buffer = new ProxyID3D10Buffer();
  buffer->device = this;
  *ppBuffer = buffer;
  HRESULT result = device()->CreateBuffer(&desc11, (D3D11_SUBRESOURCE_DATA*)pInitialData, (ID3D11Buffer**)&buffer->backend);

  CHECK_RETURN(result);

  //   return S_OK; 
}

There was also just a few problems with 2 incompatible structures between D3D10_VIEWPORT/D3D11_VIEWPORT (D3D11 is using floats instead of ints!) and D3D10_BLEND_DESC/D3D11_BLEND_DESC... but the proxy methods were easy to implement:

virtual void STDMETHODCALLTYPE RSSetViewports( 
 /* [annotation] */ 
 __in_range(0, D3D10_VIEWPORT_AND_SCISSORRECT_OBJECT_COUNT_PER_PIPELINE)  UINT NumViewports,
 /* [annotation] */ 
 __in_ecount_opt(NumViewports)  const D3D10_VIEWPORT *pViewports) {

  // Perform conversion between D3D10_VIEWPORT and D3D11_VIEWPORT
  D3D11_VIEWPORT viewports[16];
  for(int i = 0; i < NumViewports; i++) {
   viewports[i].TopLeftX = pViewports[i].TopLeftX;
   viewports[i].TopLeftY = pViewports[i].TopLeftY;
   viewports[i].Width = pViewports[i].Width;
   viewports[i].Height = pViewports[i].Height;
   viewports[i].MinDepth = pViewports[i].MinDepth;
   viewports[i].MaxDepth = pViewports[i].MaxDepth;
  }
  context->RSSetViewports(NumViewports, (D3D11_VIEWPORT*)viewports);
}

Even if I haven't performed any performance timing measurement, the cost of those proxy methods should be almost unnoticeable... and probably much more lightweight than using mutex synchronization between D3D10 and D3D11 devices!

Plug-in the proxies

In the end, I have managed to put those proxies in a single .h/.cpp with an easy API to plug the proxy. The sequence call before passing the DXGISurface to Direct2D should then be like this:

d3d11Device->CreateTexture2D(&offlineTextureDesc, 0, &texture2D);

// Create a Proxy DXGISurface from Texture2D compatible with Direct2D
IDXGISurface* surface = Code4kCreateD3D10CompatibleSurface(d3d11Device, d3d11DeviceContext, texture2D);

d2dFactory->CreateDxgiSurfaceRenderTarget(
    surface,
    &props,
    &d2dRenderTarget
    );

And that's all! You will find attached a project with the sources. Feel free to test it and let me know if you are encountering any issues with it. Also, the code is far from being 100% safe/robust... It's a quick hack. For example, I have not checked carefully that my proxies behaves well with AddRef/Release... but that should be fine.

So far, It's seems to work well on the whole Direct2D API... I have even been able to use DirectWrite with Direct2D... using Direct3D 11, without any problem. There is only one issue : PIX won't be able to debug Direct2D over Direct3D 11... because It seems that Direct2D is performing some additional method calls (D3D10CreateStateBlocks) that are incompatible with the lightweight proxies I have developed... In order to be fully supported, It would be necessary to implement all the proxies for all the interfaces returned by ID3D10Device1... But this is a sooo laborious task that by that time, we can expect to have Direct2D fully working with Direct3D 11 provided from DirectX Team itself!

Also from this little experience, I can safely confirm that It shouldn't take more than one day for one guy from the Direct2D team to patch existing Direct2D code in order to use Direct3D 11... as it is much easier to do this on the original code than going to the proxy road as I did! ;)

You can grab the VC++ 2010 project from here : D2D1ToD3D11.7z

This sample is only saving a "test.png" image using Direct2D API over Direct3D11.

Implementing an unmanaged C++ interface callback in C#/.Net

2010-10-26T02:28:00.004+11:00

Ever wanted to implement a C++ interface callback in a managed C# application? Well, although that's not so hard, this is a solution that you will probably hardly find over the Internet... the most common answer you will get is that it's not possible to do it or you should use C++/CLI in order to achieve it... In fact, in C#, you can only implement a C function delegate through the use of Marshal.GetFunctionPointerForDelegate but you won't find anything like Marshal.GetInterfacePointerFromInterface. You may wonder why do I need such a thing?

In my previous post about implementing a new DirectX fully managed API, I forgot to mention the case of interfaces callbacks. There are not so many cases in Direct3D 11 API where you need to implement a callback. You will more likely find more use-cases in audio APIs like XAudio2, but in Direct3D 11, afaik, you will only find 3 interfaces that are used for callback:

ID3DInclude which is used by D3DCompiler API in order to provide a callback for includes while using preprocessor or compiler API (see for example D3DCompile).
ID3DX11DataLoader and ID3DX11DataProcessor, which are used by some D3DX functions in order to perform asynchronous loading/processing of texture resources. The nice thing about C# is that those interfaces are useless, as it is much easier and trivial to directly implement them in C# instead

So I'm going to take the example of ID3DInclude, and how It has been successfully implemented for the SharpDX.

Memory layout of a C++ object implementing pure virtual methods

If you know how a C++ interface with pure methods is layout in memory, that's fairly easy to imagine how to hack C# to provide such a thing, but if you don't, here is a quick summary:

For example, the ID3DInclude C++ interface is declared like this :

// Interface declaration
DECLARE_INTERFACE(ID3DInclude)
{
    STDMETHOD(Open)(THIS_ D3D_INCLUDE_TYPE IncludeType, LPCSTR pFileName, LPCVOID pParentData, LPCVOID *ppData, UINT *pBytes) PURE;
    STDMETHOD(Close)(THIS_ LPCVOID pData) PURE;
};

DECLARE_INTERFACE is a Windows macro that is defined in ObjBase.h and will expand the previous declaration in C++ like this:

struct ID3DInclude {
 virtual HRESULT __stdcall Open(D3D_INCLUDE_TYPE IncludeType, LPCSTR pFileName, LPCVOID pParentData, LPCVOID *ppData, UINT *pBytes) = 0;

 virtual HRESULT __stdcall Close(LPCVOID pData) = 0;
};

Implementing and using this interface in C++ is straightforward:

struct MyIncludeCallback : public ID3DInclude {
 virtual HRESULT __stdcall Open(D3D_INCLUDE_TYPE IncludeType, LPCSTR pFileName, LPCVOID pParentData, LPCVOID *ppData, UINT *pBytes) {
     /// code for Open callback
 }

 virtual HRESULT __stdcall Close(LPCVOID pData) {
     /// code for Close callback
 }
}; 

// Usage
ID3DInclude* include = new MyIncludeCallback();

// Compile a shader and use our Include provider
D3DCompile(..., include, ...);

The hack here is to clearly understand how is layout in memory an instance of ID3DInclude through the Virtual Method Table (VTBL)... Oh, it's really funny to see that the Wikipedia article doesn't use any visual table to represent a virtual table... ok, let's remedy it. If you look at the memory address of an instanciated object, you will find an indirect pointer:

Fig 1. Virtual Method Table layout in memory

So from the pointer to a C++ object implementing pure virtual methods, you will find that the first value is a pointer to a VTBL which is shared among the same type of object (here MyIncludeCallback).

Then in the VTBL, the first value is a pointer to the Open() method implementation in memory. The second to the Close() method.

According to the calling convention, how does look the declaration of this Open() function, if we had to impleement it in pure C?

HRESULT __stdcall MyOpenCallbackFunction(void* thisObject, D3D_INCLUDE_TYPE IncludeType, LPCSTR pFileName, LPCVOID pParentData, LPCVOID *ppData, UINT *pBytes) {
     /// code for Open callback
 }

Simply add a "this object" as the 1st parameter of the callback function (which represents a pointer to the MyIncludeCallback instance in memory) and you have a callback at the function level!

You should understand now how we can easily hack this to provide a C++ interface callback in C#

Translation to the C#/.Net world

The solution is fairly simple. In order to be able to pass a C++ Interface callback implemented in C# to an unmanaged function, we need to replicate how the unmanaged world is going to call the unmanaged functions and how It does expect to have an interface layout in memory.

First, we need to define the ID3DInclude interface in pure C#:

public partial interface Include
{
    /// <summary> 
    /// A user-implemented method for opening and reading the contents of a shader #include file. 
    /// </summary> 
    /// <param name="type">A <see cref="SlimDX2.D3DCompiler.IncludeType"/>-typed value that indicates the location of the #include file. </param>
    /// <param name="fileName">Name of the #include file.</param>
    /// <param name="parentStream">Pointer to the container that includes the #include file.</param>
    /// <param name="stream">Stream that is associated with fileName to be read. This reference remains valid until <see cref="SlimDX2.D3DCompiler.Include.Close"/> is called.</param>
    /// <unmanaged>HRESULT Open([None] D3D_INCLUDE_TYPE IncludeType,[None] const char* pFileName,[None] LPCVOID pParentData,[None] LPCVOID* ppData,[None] UINT* pBytes)</unmanaged>
    //SlimDX2.Result Open(SlimDX2.D3DCompiler.IncludeType includeType, string fileNameRef, IntPtr pParentData, IntPtr dataRef, IntPtr bytesRef);
    void Open(IncludeType type, string fileName, Stream parentStream, out Stream stream);

    /// <summary> 
    /// A user-implemented method for closing a shader #include file. 
    /// </summary> 
    /// <remarks> 
    /// If <see cref="SlimDX2.D3DCompiler.Include.Open"/> was successful, Close is guaranteed to be called before the API using the <see cref="SlimDX2.D3DCompiler.Include"/> interface returns. 
    /// </remarks> 
    /// <param name="stream">This is a reference that was returned by the corresponding <see cref="SlimDX2.D3DCompiler.Include.Open"/> call.</param>
    /// <unmanaged>HRESULT Close([None] LPCVOID pData)</unmanaged>
    void Close(Stream stream);
}

Clearly, this is not exactly what we have in C++... but this is how we would use it... through the usage of Stream. An implementation of this interface would provide a Stream for a particular file to include (most of a time, that could be as simple as stream = new FileStream(fileName)).

This interface is public in the C#/.Net API... but internally we are going to use a wrapper of this interface that is going to create manually the object layout in memory as well as the VTBL. This is done in this simple constructor:

/// <summary>
/// Internal Include Callback
/// </summary>
internal class IncludeCallback
{
    public IntPtr NativePointer;
    private Include _callback;
    private OpenCallBack _openCallBack;
    private CloseCallBack _closeCallback;

    public IncludeCallback(Include callback)
    {
        _callback = callback;
        // Allocate object layout in memory 
        // - 1 pointer to VTBL table
        // - following that the VTBL itself - with 2 function pointers for Open and Close methods
        _nativePointer = Marshal.AllocHGlobal(IntPtr.Size * 3);

        // Write pointer to vtbl
        IntPtr vtblPtr = IntPtr.Add(_NativePointer, IntPtr.Size);
        Marshal.WriteIntPtr(_NativePointer, vtblPtr);
        _openCallBack = new OpenCallBack(Open);
        Marshal.WriteIntPtr(vtblPtr, Marshal.GetFunctionPointerForDelegate(_openCallBack ));
        _closeCallBack = new CloseCallBack(Close);
        Marshal.WriteIntPtr(IntPtr.Add(vtblPtr, IntPtr.Size), Marshal.GetFunctionPointerForDelegate(_closeCallBack));
    }

You can clearly see from the previous code that we are allocating a an unmanaged memory that will hold the object VTBL pointer and the VTBL itself... Because we don't need to make 2 allocation (one for the object's vtbl_ptr/data, one for the vtbl), we are laying out the VTBL just after the object itself, like this:

The declaration of the C# delegates are then straightforward from the C++ declaration:

[UnmanagedFunctionPointer(CallingConvention.StdCall)]
private delegate SlimDX2.Result OpenCallBack(IntPtr thisPtr, SlimDX2.D3DCompiler.IncludeType includeType, IntPtr fileNameRef, IntPtr pParentData, ref IntPtr dataRef, ref int bytesRef);

[UnmanagedFunctionPointer(CallingConvention.StdCall)]
private delegate SlimDX2.Result CloseCallBack(IntPtr thisPtr, IntPtr pData);

You just have to implement the Open and Close method in the wrapper and redirect the calls to the managed Include callback, et voila!

Then after, when calling an unmanaged function that required this callback, you just have to wrap an Include instance with the callback like this:

Include myIncludeInstance = ... new ...;

IncludeCallback callback = new IncludeCallback(callback);

// callback.NativePointer is a pointer to the object/vtbl allocated structure
D3D.Compile(..., callback.NativePointer, ...);

Of course, the IncludeCallback is not visible from the public API but is used internally. From a public interface POV, here is how you would use it:

using System;
using System.IO;
using SlimDX2.D3DCompiler;

namespace TestCallback
{
    class Program
    {
        class MyIncludeCallBack : Include
        {
            public void Open(IncludeType type, string fileName, Stream parentStream, out Stream stream)
            {
                stream = new FileStream(fileName, FileMode.Open);
            }

            public void Close(Stream stream)
            {
                stream.Close();
            }
        }

        static void Main(string[] args)
        {
            var include = new MyIncludeCallBack();
            string value = ShaderBytecode.PreprocessFromFile("test.fx", null, include); 
            Console.WriteLine(value);
        }
    }
}

You can have a look at the complete source code here.

High performance memcpy gotchas in C#

2010-10-23T21:47:00.008+11:00

(Edit 8 Jan 2011: Update protocol test with Buffer.BlockCopy)
(Edit 11 Oct 2012: Please vote for the x86 cpblk deficiency on Microsoft Connect)
Following my last post about an interesting use of the "cpblk" IL instruction as an unmanaged memcpy replacement, I have to admit that I didn't take the time to carefully verify that performance is actually better. Well, I was probably too optimistic... so I have made some tests and the results are very surprising and not expected to be like these...

The memcpy protocol test in C#

When dealing with 3D calculations, large buffers of textures, audio synthesizing or whatever requires a memcpy and interaction with unmanaged world, you will most notably end up with a call to an unmanaged functions like this one:

[DllImport("msvcrt.dll", EntryPoint = "memcpy", CallingConvention = CallingConvention.Cdecl, SetLastError = false), SuppressUnmanagedCodeSecurity]
public static unsafe extern void* CopyMemory(void* dest, void* src, ulong count);

In this test, I'm going to compare this implementation with 4 challengers :

The cpblk IL instruction
A handmade memcpy function
Array.Copy, although It's not relevant because they don't have the same scope. Array.Copy is managed only for arrays only while memcpy is used to copy portion of datas between managed-unmanaged as well as unmanaged-unmanaged memory.
Marshal.Copy, same as Array.Copy
Buffer.BlockCopy, which is working on managed array but is working with a byte size block copy.

The test is performing a series of memcpy with different size of block : from 4 bytes to 2Mo. The interesting part is to run this test on a x86 and x64 mode. Both tests are running on the same Windows 7 OS x64, same machine Intel Core I5 750 (2.66Ghz). The CLR used for this is the Runtime v4.0.30319.

The naive handmade memcpy is nothing more than this code (not to be the best implem ever but at least safe for any kind of buffer size):

static unsafe void CustomCopy(void * dest, void* src, int count)
{
    int block;

    block = count >> 3;

    long* pDest = (long*)dest;
    long* pSrc = (long*)src;

    for (int i = 0; i < block; i++)
    {
        *pDest = *pSrc; pDest++; pSrc++;
    }
    dest = pDest;
    src = pSrc;
    count = count - (block << 3);

    if (count > 0)
    {
        byte* pDestB = (byte*) dest;
        byte* pSrcB = (byte*) src;
        for (int i = 0; i < count; i++)
        {
            *pDestB = *pSrcB; pDestB++; pSrcB++;
        }
    }
}

Results

For the x86 architecture, results are expressed as a throughput in Mo/s - higher is better, blocksize is in bytes :

BlockSize	x86-cpblk	x86-memcpy	x86-CustomCopy	x86-Array.Copy	x86-Marshal.Copy	x86-BlockCopy
4	146	458	470	85	81	150
8	294	843	1122	168	167	298
16	587	1628	1904	306	327	577
32	950	1876	3184	631	558	1079
64	1451	3316	4295	1205	1059	1981
128	2245	5161	4848	2176	1933	3386
256	4353	7032	5333	3699	3386	5333
512	8205	13617	5517	5663	6666	7441
1024	13617	20000	6666	7710	12075	9275
2048	18823	24615	7191	9142	16842	9552
4096	2922	7529	5663	10491	7032	11034
8192	2990	7804	5714	11228	7441	11636
16384	2857	7901	5614	9142	7619	10322
32768	2379	6736	5333	8101	6666	8205
65536	2379	6808	5470	8205	6808	8205
131072	2509	17777	5818	8101	17777	8101
262144	2500	11636	5423	7032	11428	7111
524288	2539	11428	5423	7111	11428	7111
1048576	2539	11428	5470	7032	11428	7111
2097152	2529	11428	5333	7032	11034	6881

For the x64 architecture:

BlockSize2	x64-cpblk	x64-memcpy	x64-CustomCopy	x64-Array.Copy	x64-Marshal.Copy	x64-BlockCopy
4	583	346	599	99	111	219
8	1509	770	1876	212	224	469
16	2689	1451	3316	417	422	903
32	4705	2666	5000	802	864	1739
64	8205	4812	7272	1568	1748	3350
128	13333	8101	9014	3004	3184	6037
256	18823	11428	10000	5470	5245	8648
512	22068	16000	10491	9014	9552	13913
1024	22857	19393	7356	13333	13617	16842
2048	23703	21333	7710	17297	17777	20645
4096	23703	22068	7804	19393	20000	21333
8192	23703	22857	7619	22068	22068	22857
16384	23703	22857	7804	17297	21333	18285
32768	16410	16410	7710	12800	16000	12800
65536	13061	14883	7710	13061	14545	13061
131072	14222	13913	7710	12800	13617	12800
262144	5000	5039	7032	7901	5000	7804
524288	5079	5000	7356	8205	5079	7804
1048576	4885	4885	7272	7441	4671	7529
2097152	5039	5079	7272	7619	5000	7710

Graph comparison only for cpblk, memcpy and CustomCopy:

Don't be afraid about the performance drop for most of the implem... It's mostly due to cache missing and copying around different 4k pages.

Conclusion

Don't trust your .NET VM, check your code on both x86 and x64. It's interesting to see how much the same task is implemented differently inside the CLR (see Marshal.Copy vs Array.Copy vs Buffer.Copy)

The most surprising result here is the poor performance of cpblk IL instruction in x86 mode compare to the best one in x64 which is... cpblk. So to summarize:

On x86, you should better use a memcpy function
On x64, you should better use a cpblk function, which is performing better from small size (twice faster than memcpy) to large size.

You may wonder why the x86 version is so unoptimized? This is because the x86 CLR is generating a x86 instruction that is performing a memcpy on a PER BYTE basis (rep movb for x86 folks), even if you are moving a large memory chunk of 1Mo! In comparison, a memcpy as implemented in MSVCRT is able to use SSE instructions that are able to batch copy with large 128 bits registers (with also an optimized case for not poluting CPU cache). This is the case for x64 that seems to use a correct implemented memcpy, but the x86 CLR memcpy is just poorly implemented. Please vote for this bug described on Microsoft Connect.

One important consequence of this is when you are developping a C++/CLI and calling a memcpy from a managed function... It will end up in a cpblk copy functions... which is almost the worst case on x86 platforms... so be careful if you are dealing with this kind of issue. To avoir this, you have to force the compiler to use the function from the MSVCRTxx.dll.

Of course, the memcpy is platform dependent, which would not be an option for all...

Also, I didn't perform this test on a CLR 2 runtime... we could be surprised as well... There is also one thing that I should try against a pure C++ memcpy using the optimized SSE2 version that is shipped with later msvcrt.

You can download the VS2010 project from here

A new managed .NET/C# Direct3D 11 API generated from DirectX SDK headers

2010-10-19T21:36:00.203+11:00

I have been quite busy since the end of august, personally because I'm proud to announce the birth of my daughter! (and his older brother, is somewhat, asking a lot more attention since ;) ) and also, working hard on an exciting new project based on .NET and Direct3D.

What is it? Yet Another Triangle App? Nope, this is in fact an entirely new .NET API for Direct3D11, DXGI, D3DCompiler that is fully managed without using any mixed assemblies C++/CLI but having similar performance than a true C++/CLI API (like SlimDX). But the main characteristics and most exciting thing about this new wrapper is that the whole code marshal/interop is fully generated from the DirectX SDK headers, including the MSDN documentation.

The current key features and benefits of this approach are:

API is generated from DirectX SDK headers : the mapping is able to perform "complex transformation", extracting all relevant information like enumerations, structures, interfaces, functions, macro definitions, guids from the C++ source headers. For example, the mapping process is able to generated properties for interfaces or inner group interface like the one you have in SlimDX : meaning that instead of having a "device.IASetInputLayout" you are able to write "device.InputAssembler.InputLayout = ...".
Full support of Direct3D 11, DXGI 1.0/1.1, D3DCompiler API : Due to the whole auto-generated process, the actual coverage is 100%. Although, I have limited the generated code to those library but that could be extended to others API quite easily (like XAudio2, Direct2D, DirectWrite... etc.).
Pure managed .NET API : assemblies are compiled with AnyCpu target. You can run your code on a x64 or a x86 machine with the same assemblies.
API Extensibility The generated code is in C#, all the types are marked "partial" and are easily extensible to provide new helpers method. The code generator is able to hide some methods/types internally in order to use them in helper methods and to hide them from the public api.
C++/CLI Speed : the framework is using a genuine way to avoid any C++/CLI while still achieving comparable performance.
Separate assemblies : a core assembly containing common classes and an assembly for each subgroup API (Direct3D, DXGI, D3DCompiler)
Lightweight assemblies : generated assemblies are lightweight, 300Ko in total, 70Ko compressed in an archive (similar assemblies in C++/CLI would be closer to 1Mo, one for each architecture, and depend from MSVCRT10)
API naming convention very close to SlimDX API (To make it 100% equals would just require to specify the correct mapping names while generating the code)
Raw DirectX object life management : No overhead of ObjectTable or RCW mechanism, the API is using direct native management with classic COM method "Release". Currently, instead of calling Dispose, you should call Release (and call AddRef if you are duplicating references, like in C++). I might evaluate how to safely integrate Dispose method call.
Easily obfuscatable : Due to the fact the framework is not using any mixed assemblies
DirectX SDK Documentation integrated in the .NET xml comments : The whole API is also generated with the MSDN documentation. Meaning that you have exactly the same documentation for DirectX and for this API (this is working even for method parameters, remarks, enum items...etc.). Reference to other types inside the documentation are correctly linked to the .NET API.
Prototype for a partial support of the Effects11 API in full managed .NET.

If you have been working with SlimDX, some of the features here could sound familiar and you may wonder why another .DirectX NET API while there is a great project like SlimDX? Before going further in the detail of this wrapper and how things are working in the background, I'm going to explain why this wrapper could be interesting.

I'm also currently not in the position to release it for the reason that I don't want to compete with SlimDX. I want to see if SlimDX Team would be interested to work together with this system, a kind of joint-venture. There are still lots of things to do, improving the mapping, making it more reliable (the whole code here has been written in a urge since one month...) but I strongly believe that this could be a good starting point to SlimDX 2, but I might be wrong... also, SlimDX could think about another road map... So this is a message to the SlimDX Team : Promit, Josh, Mike, I would be glad to hear some comments from you about this wrapper (and if you want, I could send you the generated API so that you could look at it and test it!)

[Updated 30 November 2010]
This wrapper is now available from SharpDX. Check this post.
[/Updated]

This post is going to be quite long, so if you are not interested by all the internals, you could jump to the sample code at the end.

An attempt to a SlimDX next gen

First of all, is it related to 4k or 64k intros? (an usual question here, mostly question for myself :D) Well, while I'm still working to make things smaller, even in .NET, I would like to work on a demo based on .NET (but with lots of procedurally generated textures and music). I have been evaluating both XNA and SlimDX, and in September, I have even been working on a XNA like API other SlimDX / Direct3D 11 that was working great, simplifiying a lot the code, while still having benefits to use new D3D11 API (Geometry shaders, Compute Shaders...etc.). I will talk later about this "Demo" layer API.

As a demo maker for tiny executable, even in .NET, I found that working with SlimDX was not the best option : even stripping the code, recompiling the SlimDX to keep only DirectX11/DXGI&co, I had a roughly 1Mo dll (one for each architecture) + a dependency to MSVRT10 which is a bit annoying. Even if I would like to work on a demo (with less size constraint), I didn't want to have a 100Ko exe and a 1Mo compressed of external dlls...

Also, I read some of Josh's thoughts about SlimDX 2 : I was convinced about the need for separated assemblies and simplified life object management. But was not convinced by the need to use "interfaces" for the new API and not really happy about still having some platform specific mixed-assemblies in order to support correctly 32/64 bit architecture (with a simple delay loading).

What is supposed to address SlimDX 2 over SlimDX?

Making object life management closer to the real thing (no Dispose but raw Release instead)
Multiple assemblies
Working on the API more with C# than in C++/CLI
Support automatic platform architecture switching (running transparently an executable on a x86 and x64 machine without recompiling anything).

Recall that I was slightly working around August on parsing the SDK headers based on Boost::Wave V2.0. My concern was that I have developed a SlimDX like interface in C++ for Ergon demo, but I found the process to be very laborious, although very straightforward, while staying in the same language as DirectX... Thinking more about it, and because I wanted to do more work in 3D and C# (damn it, this language is SOOO cool and powerful compared to C++)... I found that It would be a great opportunity to see if it's not possible to extract enough information from the SDK headers in order to generate a Direct3D 11 .NET/C# API.

And everything has been surprisingly very fast : extraction of all the code information from the SDK C++ headers file was in fact quite easy to code, in few days... and generating the code was quite easy (I have to admit that I have a strong experience in this kind of process, and did similar work, around ten years ago, in Java, delivering an innovative Java/COM bridge layer for the company I was working at that time, much safer than Sun Java/COM layer that was buggy and much more powerfull, supporting early binding, inheritance, documentation... etc).

In fact, with this generating process, I have been able to address almost all the issue that were expected to be solved in SlimDX 2, and moreover, It's going a bit further because the process is automated and It's supporting the platform x86/x64 without requiring any mixed assemblies.

In the following sections, I'm going to deeply explain the architecture, features, internals and mapping rules used to generate this new .Net wrapper (which has currently the "SharpDX" code name).

Overview

In order to generate Managed .NET API for DirectX from the SDK headers, the process is composed of 3 main steps:

Convert from the DirectX SDK C++ Headers to an intermediate format called "XIDL" which is a mix of XML and "IDL". This first part is responsible to reverse engineering the headers, extract back all existing and useful information (more on the following section), and produce a kind of IDL (Intermediate Definition Language). In fact, If I had access to the IDL used internally at Microsoft, It wouldn't have been necessary to write this whole part, but sadly, the DirectX 11 IDL is not available, although you can clearly verify from the D3D11.h that this file is generated from an IDL. This module is also responsible to access MSDN website and crawl the needed documentation, and associate it with all the languages elements (structures, structures fields, enums, enum items, interfaces, interfaces methods, method parameters...etc.). Once a documentation has been retrieved, It's stored on the disk and is not retrieved next time the conversion process is re-runned.
Convert from the XIDL file to several C# files. This part is responsible to perform from a set of mapping rules a translation of C++ definition to C# definition. The mapping is as complex as identifying which include would map to assembly/namespace, which type could be moved to an assembly/namespace, how to rename the types,functions, fields, parameters, how to add missing information from the XIDL file...etc. The current mapping rules are express in less then 600 lines of C# code... There is also a trick here not described in the picture. This process is also generating a small interop assembly which is only used at compile time, dynamically generated at runtime and responsible for filling the gap between what is possible in C# and what you can do in C++/CLI (there are lots of small usefull IL bytecode instructions generated in C++/CLI that are not accessible from C#, this assembly is here for that....more on this in the Convert to XIDL section).
Integrate the generated files in several Visual Studio projects and a global solution. Each project is generating an assembly. It is where you can add custom code that could not be generated (like Vector3 math functions, or general framework objects like a ComObject). The generated code is also fully marked with "partial" class, one of the cool things of C# : you can have multiple files contributing to the same class declaration... making things easy to have generated code on the side of custom hand made code.

Revert DirectX IDL from headers

Unfortunately, I have not found a workable C preprocessor written in .NET, and this part has been a bit laborious to make it work. The good thing is that I have found Boost Wave 2.0 in C++. The bad thing is that this library, written in a heavy boost-STL-templatizer philosophy was really hard to manage to work under a C++/CLI DLL. Well, the principle was to embed Boost Wave in a managed DLL, in order to use it from C#... after several attempts, I was not able to build it with C++/CLI .NET 4.0. So I ended up in a small dll COM wrapper around BoostWave, and a thin wrapper in .NET calling this dll. Compiling Boost-Wave was also sometimes a nightmare : I tried to implement my own provider of stream for Wave... but dealing with a linker error that was freezing VS2010 for 5s to display the error (several Ko of a single template cascaded error)... I have found somewhere on the Wave release that It was in fact not supported... but wow, templates are supposed to make life easier... but the way It is used gives a really bad feeling... (and I'm not a beginner in C++ template...)

Anyway, after succeeding to wrap BoostWave API, I had a bunch of tokens to process. I started to wrote a handwritten C/C++ parser, which is targeted to read well-formed DirectX headers and nothing else. It was quite tricky sometimes, the code is far from being failsafe, but I succeed to parse correctly most of the DirectX headers. During the mapping to C#, I was able to find a couple of errors in the parser that were easy to fix.

In the end, this parser is able to extract from the headers:

Enumerations, Structures, Interfaces, Functions, Typedefs
Macros definitions
GUIDs
Include dependency

The whole data is stored in a C# model that is marshaled in XML using WCF (DataMember, DataContract), which make the code really easy to write, not much intrusive and you can serialize and deserialize to XML. For example, a CppType is defined like this:

//
using System.Runtime.Serialization;
using System.Text;

namespace SharpDX.Tools.XIDL
{
    [DataContract]
    public class CppType : CppElement
    {
        [DataMember(Order=0)]
        public string Type { get; set;}
        [DataMember(Order=1)]
        public string Specifier { get; set; }
        [DataMember(Order=2)]
        public bool Const { get; set; }
        [DataMember(Order = 3)]
        public bool IsArray { get; set; }
        [DataMember(Order=4)]
        public string ArrayDimension { get; set; }

The model is really lightweight, no fancy methods and easy to navigate in.

The process is also responsible to get documentation for each C++ items (enumerations, structures, interfaces, functions). The documentation is requested to MSDN while generating all the types. That was also a bit tricky to parse, but in the end, the class is very small (less than 200 lines of C# code). Downloaded documentation is stored on the disk and is used for later re-generation of the parsing.

The generated XML model is taking around 1.7Mo for DXGI, D3D11, D3DX11, D3DCompiler includes and looks like this:

      <Interfaces>
        <CppInterface>
          <Name>ID3D11DeviceChildName>
          <Description>A device-child interface accesses data used by a device.Description>
          <Remarks i:nil="true" />
          <Parent>IUnknownParent>
          <Methods>
            <CppMethod>
              <Name>GetDeviceName>
              <Description>Get a pointer to the device that created this interface.Description>
              <Remarks>Any returned interfaces will have their reference count incremented by one, so be sure to call ::release() on the returned pointer(s) before they are freed or else you will have a memory leak.Remarks>
              <ReturnType>
                <Name i:nil="true" />
                <Description>voidReturns nothing.Description>
                <Remarks i:nil="true" />
                <Type>voidType>
                <Specifier>Specifier>
                <Const>falseConst>
                <IsArray>falseIsArray>
                <ArrayDimension i:nil="true" />
              ReturnType>
              <CallingConvention>StdCallCallingConvention>
              <Offset>3Offset>
              <Parameters>
                <CppParameter>
                  <Name>ppDeviceName>
                  <Description>Address of a pointer to a device (see {{ID3D11Device}}).Description>
                  <Remarks i:nil="true" />
                  <Type>ID3D11DeviceType>
                  <Specifier>**Specifier>
                  <Const>falseConst>
                  <IsArray>falseIsArray>
                  <ArrayDimension i:nil="true" />
                  <Attribute>OutAttribute>
                CppParameter>
              Parameters>
            CppMethod>

One of the most important thing in the DirectX headers that are required to develop a reliable code generator is the presence of C+ windows specific attributes : all the methods are prefix by macros __out __in __out_opt , __out_buffer... etc. All those attributes are similar to C# attributes and are explaining how to interpret the parameter. If you take the previous code, there is a method GetDevice that is returning a ID3D11Device through a [out] parameter. The [Out] parameter is extremely important here, as we know exactly how to use it. Same thing when you have a pointer which is in fact a buffer : with the attributes, you know that this is an array of elements behind the pointer...

Although, I have discovered that some functions/methods sometimes are lacking some attributes.... but hopefully, the next process (the mapping from XIDL to C#) is able to add missing information like this.

As I said, the current implementation is far from being failsafe and would probably require more testing on other headers files. At least, the process is correctly working on a subset of the DirectX headers.

Generate C# from IDL

This part of the process has been a lot more time consuming. I started with enums, which were quite straightforward to manage. Structures were asking a bit more work, as there is some need for some custom marshalling for some structures that cannot marshal easily... Then interfaces methods were the most difficult part, correctly handling all parameters case was not easy...

The process of generating the C# code is done in 3 steps:

Reading XIDL model and prepare the model for mapping: remove types, add information to some methods.
Generate a C# model with the XIDL model and a set of mapping rules
Generate C# files from the C# model. I have used T4 "Text Template Transformation Toolkit" engine as a text templatizer, which is part of VS2010 and is really easy to use, integrated in VS2010 with a third party syntax highlighting plugin.

This step is also responsible to generate an interop assembly which is emiting directly some .NET IL bytecodes through the System.Reflection.Emit. This interop assembly is the trick to avoid the usage of a C++/CLI mixed assembly

Preamble) How to avoid the usage of C++/CLI in C#

If you look at some generated C++/CLI code with Reflector, you will see that most of the code is in fact a pure IL bytecode, even when there is a call to a native function or native methods...

The trick here is that there are a couple of IL instructions that are used internally by C# but not exposed to the language.

1) The instruction "calli"

This instruction is responsible to call directly an unmanaged function, without going through the pinvoke/interop layer (in fact, pinvoke is calling in the end "calli", but is performing a much more complex marshaling of the parameters, structures...)

What I need was a way to call an umanaged function/methods without going through the pinvoke layer, and "calli" is exactly here for this. Now, suppose that we could generate a small assembly at compile time and at runtime that would be responsible for handling those calli function, we would not have to use anymore C++/CLI for this.

For example, suppose that I want to call a C++ method of an interface which takes an integer as a parameter, something like :

interface IDevice : IUnknown {
    void Draw(int count);
}

I only need a function in C# that is able to directly call this method, without going the pinvoke layer, with a pointer to the C++ IDevice object and the offset of the method in the vtbl (offset will be expressed in bytes, for a x86 architecture here) :

class Interop {
    public static unsafe void CalliVoid(void* thisObject, int vtblOffset, int arg0);
}

// A call to IDevice
void* ptrToIDevice = ...;

// A Call to the method Draw, number 3 in the vtbl order (starting at 0 to 2 for IUnknown methods)
Interop.CalliVoid(ptrToIDevice, /* 3 * sizeof(void* in x86) */ 3 * 4 , /* count */4 );

The IL bytecode content of this method for a x64 architecture would be typically in C++/CLI like this:

.method public hidebysig static void CalliVoid(void* arg0, int32 arg1, int32 arg2) cil managed
{
    .maxstack 4
    L_0000: ldarg.0      // Load (0) this arg (1st parameter for native method)
    L_0001: ldarg.2      // Load (1) count arg
    L_0002: ldarg.1      // Offset in vtbl
    L_0003: conv.i       // Convert to native int
    L_0004: dup          //
    L_0005: add          // Offset = offset * 2 (only for x64 architecture)
    L_0006: ldarg.0      // 
    L_0007: ldind.i      // Load vtbl poointer
    L_0008: add          // pVtbl = pVtbl + offset
    L_0009: ldind.i      // load function from the vtbl fointer
    L_000a: calli method unmanaged stdcall void *(void*, int32)
    L_000f: ret 
}

This kind of code will be automatically inlined by the JIT (which is, from SCCLI/Rotor sourcecode, inlining functions that are taking less than 25 bytes of bytecode).

If you look at a C++/CLI assembly, you will see lots of "calli" instructions.

So in the end, how this trick is used? Because the generator knows all the methods from all the interfaces, it is able to generate a set of all possible calling conventions to unmanaged object. In fact, the XIDLToCSharp generator is responsible to generate an assembly containing all the interop methods (around 66 methods using Calli) :

public class Interop
{
    private Interop();
    public static unsafe float CalliFloat(void* arg0, int arg1, void* arg2);
    public static unsafe int CalliInt(void* arg0, int arg1);
    public static unsafe int CalliInt(void* arg0, int arg1, int arg2);
    public static unsafe int CalliInt(void* arg0, int arg1, void* arg2);
    public static unsafe int CalliInt(void* arg0, int arg1, long arg2);
    public static unsafe int CalliInt(void* arg0, int arg1, int arg2, int arg3);
    public static unsafe int CalliInt(void* arg0, int arg1, long arg2, int arg3);
    public static unsafe int CalliInt(void* arg0, int arg1, void* arg2, int arg3);
    public static unsafe int CalliInt(void* arg0, int arg1, void* arg2, void* arg3);
    public static unsafe int CalliInt(void* arg0, int arg1, int arg2, void* arg3);
    public static unsafe int CalliInt(void* arg0, int arg1, IntPtr arg2, void* arg3);
    public static unsafe int CalliInt(void* arg0, int arg1, IntPtr arg2, int arg3);
    public static unsafe int CalliInt(void* arg0, int arg1, int arg2, void* arg3, int arg4);
    public static unsafe int CalliInt(void* arg0, int arg1, int arg2, void* arg3, void* arg4);
    public static unsafe int CalliInt(void* arg0, int arg1, void* arg2, int arg3, void* arg4);
    public static unsafe int CalliInt(void* arg0, int arg1, int arg2, int arg3, void* arg4);
    public static unsafe int CalliInt(void* arg0, int arg1, void* arg2, void* arg3, void* arg4);
    public static unsafe int CalliInt(void* arg0, int arg1, IntPtr arg2, void* arg3, void* arg4);
    public static unsafe int CalliInt(void* arg0, int arg1, void* arg2, void* arg3, int arg4);
    public static unsafe int CalliInt(void* arg0, int arg1, int arg2, int arg3, void* arg4, void* arg5);
    public static unsafe int CalliInt(void* arg0, int arg1, void* arg2, void* arg3, int arg4, int arg5);
    //
    // ...[stripping Calli x methods here]...
    //
    public static unsafe void CalliVoid(void* arg0, int arg1, int arg2, void* arg3, void* arg4, int arg5, int arg6, void* arg7);
    public static unsafe void CalliVoid(void* arg0, int arg1, void* arg2, float arg3, float arg4, float arg5, float arg6, void* arg7);
    public static unsafe void CalliVoid(void* arg0, int arg1, int arg2, void* arg3, void* arg4, int arg5, int arg6, void* arg7, void* arg8);
    public static unsafe void CalliVoid(void* arg0, int arg1, void* arg2, int arg3, int arg4, int arg5, int arg6, void* arg7, int arg8, void* arg9);
    public static unsafe void* Read<T>(void* pSrc, ref T data) where T: struct;
    public static unsafe void* Read<T>(void* pSrc, T[] data, int offset, int count) where T: struct;
    public static unsafe void* Write<T>(void* pDest, ref T data) where T: struct;
    public static unsafe void* Write<T>(void* pDest, T[] data, int offset, int count) where T: struct;
    public static void memcpy(void* pDest, void* pSrc, int Count);
}

This assembly is used at compile time but is not distributed at runtime. Instead, this assembly is dynamically generated at runtime in order to support difference in bytecode between x86 and x64 (in the calli example, we need to multiply by 2 the offset into the vtbl table, because the sizeof of a pointer in x64 is 8 bytes).

2) The instruction "sizeof" for generic

Although the Calli is the real trick that makes it possible to have a managed way to call unmanaged method without using pinvoke, I have found a couple of other IL bytecode that is necessary to have the same features than in C++/CLI.

The other one is sizeof for generic. In C#, we know that there is a sizeof, but while trying to replicate the DataStream class from SlimDX in pure C#, I was not able to write this kind code :

public class DataStream
{
    // Unmarshal a struct from a memory location
    public T Read<T>() where T: struct {
        T myStruct = default(T);
        memcpy(&mystruct, &m_buffer, sizeof(T));
        return myStruct;
    }
}

In fact, under C#, the sizeof is not working for a generic, even if you specify that the generic is a struct. Because C# cannot constraint the struct to contains only blittable fields (I mean, It could, but It doesn't try to do it), they don't allow to take the size of a generic struct... that was annoying, but because with pure IL instruction, It's working well and I was already generating the Interop assembly, I was free to add whatever methods with custom bytecode to fill the gap...

In the end, the interop code to read a generic struct from a memory location looks like this :

// This method is reading a T struct from pSrc and returning the address : pSrc + sizeof(T)
.method public hidebysig static void* Read<valuetype .ctor T>(void* pSrc, !!T& data) cil managed
{
    .maxstack 3
    .locals init (
        [0] int32 num,
        [1] !!T* pinned localPtr)
    L_0000: ldarg.1 
    L_0001: stloc.1 
    L_0002: ldloc.1 
    L_0003: ldarg.0 
    L_0004: sizeof !!T
    L_000a: conv.i4 
    L_000b: stloc.0 
    L_000c: ldloc.0 
    L_000d: unaligned 1        // Mandatory for x64 architecture
    L_0010: nop 
    L_0011: nop 
    L_0012: nop 
    L_0013: cpblk              // Memcpy
    L_0015: ldloc.0 
    L_0016: conv.i 
    L_0017: ldarg.0 
    L_0018: add 
    L_0019: ret 
}

3) The instruction "cpblk", memcpy in IL

In the previous function, you can see the use of "cpblk" bytecode instruction. In fact, when you are looking at a C++/CLI method using a memcpy, It will not use the memcpy from the C CRT but directly the IL instruction performing the same task. This IL instruction is faster than using anykind of interop, so I made it available to C# through the Interop assembly

I) Prepare XIDL model for mapping

So the 1st step in the XIDLToCSharp process is to prepare the XIDL model to be more mapping friendly. This step is essentially responsible to:

Add missing C++ attributes (In, InOut, Buffer) information to some method's parameter
Replace the type of some method parameters : for example in DirectX, there are lots of parameter that are taking a flags, which is in fact an already declared enum... but for some unknown reason, they are declaring the method with an "int" instead of using the enum...
Remove some types. For example, the D3D_PRIMITIVE_TOPOLOGY is holding a bunch of D3D11 and D3D10 enum, duplicating D3D_PRIMITIVE enums... So I'm removing them.
Add some tag directly on the XIDL model in order to ease the next mapping process : those tags are for example used for tagging the C# visibility of the method, or forcing a method to not be interpreted as a "property")

// Read the XIDL model
    CppIncludeGroup group = CppIncludeGroup.Read("directx_idl.xml");

    group.Modify<CppParameter>("^D3DX11.*?::pDefines", Modifiers.ParameterAttribute(CppAttribute.In | CppAttribute.Buffer | CppAttribute.Optional));

    // Modify device Flags for D3D11CreateDevice to use D3D11_CREATE_DEVICE_FLAG
    group.Modify<CppParameter>("^D3D11CreateDevice.*?::Flags$", Modifiers.Type("D3D11_CREATE_DEVICE_FLAG"));

    // ppFactory on CreateDXGIFactory.* should be Attribute.Out
    group.Modify<CppParameter>("^CreateDXGIFactory.*?::ppFactory$", Modifiers.ParameterAttribute(CppAttribute.Out));

    // pDefines is an array of Macro (and not just In)
    group.Modify<CppParameter>("^D3DCompile::pDefines", Modifiers.ParameterAttribute(CppAttribute.In | CppAttribute.Buffer | CppAttribute.Optional));
    group.Modify<CppParameter>("^D3DPreprocess::pDefines", Modifiers.ParameterAttribute(CppAttribute.In | CppAttribute.Buffer | CppAttribute.Optional));

    // SwapChain description is mandatory In and not optional
    group.Modify<CppParameter>("^D3D11CreateDeviceAndSwapChain::pSwapChainDesc", Modifiers.ParameterAttribute(CppAttribute.In));

    // Remove all enums ending with _FORCE_DWORD, FORCE_UINT
    group.Modify<CppEnumItem>("^.*_FORCE_DWORD$", Modifiers.Remove);
    group.Modify<CppEnumItem>("^.*_FORCE_UINT$", Modifiers.Remove);

You can see that the pre-mapping (and the mapping) is using intensively regular expression for matching names, which is a very convenient way to perform some kind of XPATH request with Regex expressions.

II) Generate C# model from XIDL and mapping rules

This process is taking the pre-process XIDL and is generating a C# model (a subset of the C# model in memory), adding mapping information and preparing things to make it easier to use it from the T4 templatizer engine.

In order to generate the C# model from DirectX, the generator needs a couple of mapping rules.

1) Mapping an include to an assembly / namespace

This rules is defining a default dispatching of types to assembly / namespace. It will associate source headers include (the name of the .h, without the extension).

// Namespace mapping 

  // Map dxgi include to assembly SharpDX.DXGI, namespace SharpDX.DXGI
  gen.MapIncludeToNamespace("dxgi", "SharpDX.DXGI");
  gen.MapIncludeToNamespace("dxgiformat", "SharpDX.DXGI");
  gen.MapIncludeToNamespace("dxgitype", "SharpDX.DXGI");

  // Map D3DCommon include to assembly SharpDX, namespace SharpDX.Direct3D
  gen.MapIncludeToNamespace("d3dcommon", "SharpDX.Direct3D", "SharpDX");

  gen.MapIncludeToNamespace("d3d11", "SharpDX.Direct3D11");
  gen.MapIncludeToNamespace("d3dx11", "SharpDX.Direct3D11");
  gen.MapIncludeToNamespace("d3dx11core", "SharpDX.Direct3D11");
  gen.MapIncludeToNamespace("d3dx11tex", "SharpDX.Direct3D11");
  gen.MapIncludeToNamespace("d3dx11async", "SharpDX.Direct3D11");
  gen.MapIncludeToNamespace("d3d11shader", "SharpDX.D3DCompiler");
  gen.MapIncludeToNamespace("d3dcompiler", "SharpDX.D3DCompiler");

2) Mapping a particular type to an assembly / namespace

It is also necessary to override the default include to assembly/namespace dispatching for some particular types. This rules is doing this.

gen.MapTypeToNamespace("^D3D_PRIMITIVE$", "SharpDX.D3DCompiler");
    gen.MapTypeToNamespace("^D3D_CBUFFER_TYPE$", "SharpDX.D3DCompiler");
    gen.MapTypeToNamespace("^D3D_RESOURCE_RETURN_TYPE$", "SharpDX.D3DCompiler");
    gen.MapTypeToNamespace("^D3D_SHADER_CBUFFER_FLAGS$", "SharpDX.D3DCompiler");
    gen.MapTypeToNamespace("^D3D_SHADER_INPUT_TYPE$", "SharpDX.D3DCompiler");
    gen.MapTypeToNamespace("^D3D_SHADER_VARIABLE_CLASS$", "SharpDX.D3DCompiler");
    gen.MapTypeToNamespace("^D3D_SHADER_VARIABLE_FLAG$S", "SharpDX.D3DCompiler");
    gen.MapTypeToNamespace("^D3D_SHADER_VARIABLE_TYPE$", "SharpDX.D3DCompiler");
    gen.MapTypeToNamespace("^D3D_TESSELLATOR_DOMAIN$", "SharpDX.D3DCompiler");
    gen.MapTypeToNamespace("^D3D_TESSELLATOR_PARTITIONING$", "SharpDX.D3DCompiler");
    gen.MapTypeToNamespace("^D3D_TESSELLATOR_OUTPUT_PRIMITIVE$", "SharpDX.D3DCompiler");
    gen.MapTypeToNamespace("^D3D_SHADER_INPUT_FLAGS$", "SharpDX.D3DCompiler");
    gen.MapTypeToNamespace("^D3D_NAME$", "SharpDX.D3DCompiler");
    gen.MapTypeToNamespace("^D3D_REGISTER_COMPONENT_TYPE$", "SharpDX.D3DCompiler");

The previous code is instructing the generator to move some D3D types to the SharpDX.D3DCompiler namespace (and assembly). Those types are in fact more related to Shader reflection and are associated with the D3DCompiler assembly (I took the same design choice from SlimDX, although we could think about another mapping).

3) Mapping a C++ type to a custom C# type

It is sometimes necessary to map a C++ type to a non generated C# type. For example, there is the C++ "RECT" structure which is not stritcly equivalent to the System.Drawing.Rectangle (the RECT struct is using the Left,Top,Right,Bottom fields instead of Left,Top,Width,Height for System.Drawing.Rectangle). This mapping is able to define a custom mapping. The SharpDX.Rectangle is not generated by the generator but is defined in the SharpDX assembly project (last part).

var rectType = new CSharpStruct();
 rectType.Name = "SharpDX.Rectangle";
 rectType.SizeOf = 4*4;
 gen.MapCppTypeToCSharpType("RECT", rectType); //"SharpDX.Rectangle", 4 * 4, false, true);

4) Mapping a C++ name to a C# name
The renaming rules are quite rich. The XIDLToCSharp provides a default renaming mechanism that respect the CamelCase convention, but there are some exceptions that need to be addressed. For example:

// Rename DXGI_MODE_ROTATION to DisplayModeRotation
  gen.RenameType(@"^DXGI_MODE_ROTATION$","DisplayModeRotation");
  gen.RenameType(@"^DXGI_MODE_SCALING$", "DisplayModeScaling");
  gen.RenameType(@"^DXGI_MODE_SCANLINE_ORDER$", "DisplayModeScanlineOrder");

  // Use regular expression to take the part of some names...
  gen.RenameType(@"^D3D_SVC_(.*)", "$1");
  gen.RenameType(@"^D3D_SVF_(.*)", "$1");
  gen.RenameType(@"^D3D_SVT_(.*)", "$1");
  gen.RenameType(@"^D3D_SIF_(.*)", "$1");
  gen.RenameType(@"^D3D_SIT_(.*)", "$1");
  gen.RenameType(@"^D3D_CT_(.*)", "$1");

For structures and enums that are using the "_" underscore to separate name subpart, you can let XIDLToCSharp rename correctly each subpart, while still being able to specify how a subpart can be rename:

// Expand sub part between underscore
 gen.RenameTypePart("^DESC$", "Description");
 gen.RenameTypePart("^CBUFFER$", "ConstantBuffer");
 gen.RenameTypePart("^TBUFFER$", "TextureBuffer");
 gen.RenameTypePart("^BUFFEREX$", "ExtendedBuffer");
 gen.RenameTypePart("^FUNC$", "Function");
 gen.RenameTypePart("^FLAG$", "Flags");
 gen.RenameTypePart("^SRV$", "ShaderResourceView");
 gen.RenameTypePart("^DSV$", "DepthStencilView");
 gen.RenameTypePart("^RTV$", "RenderTargetView");
 gen.RenameTypePart("^UAV$", "UnorderedAccessView");
 gen.RenameTypePart("^TEXTURE1D$", "Texture1D");
 gen.RenameTypePart("^TEXTURE2D$", "Texture2D");
 gen.RenameTypePart("^TEXTURE3D$", "Texture3D");

With this rules, for example with a struct named as "BLABLA_DESC", the DESC part will be expand to "Description", resulting in the C# name "BlablaDescription".

5) Change Field type mapping in C#

Again, there are lots of enums in DirectX that are not used in the structures. For example, if you take the D3D11_BUFFER_DESC, all enums are declared as int instead of using their respective enums.

This mapping rules is responsible to change the destination type for a field:

gen.ChangeStructFieldTypeToNative("D3D11_BUFFER_DESC", "BindFlags", "D3D11_BIND_FLAG");
 gen.ChangeStructFieldTypeToNative("D3D11_BUFFER_DESC", "CPUAccessFlags", "D3D11_CPU_ACCESS_FLAG");
 gen.ChangeStructFieldTypeToNative("D3D11_BUFFER_DESC", "MiscFlags", "D3D11_RESOURCE_MISC_FLAG");

6) Generate enums from C++ macros, improving enums

Again, DirectX SDK is not consistent with enums. Sometimes there are some enums that are in fact defined with some macro definition, which makes intellisense experience inexistent...

XIDLToCSharp is able to create an enum from a set of macros definitions

// Create enums from macro definitions
 // Create the D3DCOMPILE_SHADER_FLAGS C++ type from the D3DCOMPILE_.* macros
 gen.CreateEnumFromMacros(@"^D3DCOMPILE_[^E][^F].*", "D3DCOMPILE_SHADER_FLAGS");
 gen.CreateEnumFromMacros(@"^D3DCOMPILE_EFFECT_.*", "D3DCOMPILE_EFFECT_FLAGS");
 gen.CreateEnumFromMacros(@"^D3D_DISASM_.*", "D3DCOMPILE_DISASM_FLAGS");

There are also some tiny things to adjust to existing enums, like adding a "None=0" enum item for some flags.

7) Move interface methods to inner interfaces in C#

If you have been using Direct3D 11, you have notice that all methods for each stages are prefix with the stage abbreviation, making for example the ID3D11DeviceContext interface quite ugly to use, ending in some code like this:

deviceContext.IASetInputLayout(inputlayout);

SlimDX did something really nice : they have created for each pipeline stage (IA for InputAssembler, VS for VertexShader) a property accessor to an interface that is exposing the method of this stage, resulting in an improved readability and a much better intellisense experience.

deviceContext.InputAssembler.InputLayout = inputlayout;

In the XIDL2CSharp, there is a rules to handle such a case, and is simple as writing this:

// Map all IA* methods to the internal interface InputAssemblerStage with the acessor property InputAssembler, using the method name $1 (extract from the regexp)
 gen.MoveMethodsToInnerInterface("ID3D11DeviceContext::IA(.*)", "InputAssemblerStage", "InputAssembler", "$1");
 gen.MoveMethodsToInnerInterface("ID3D11DeviceContext::VS(.*)", "VertexShaderStage", "VertexShader", "$1");
 gen.MoveMethodsToInnerInterface("ID3D11DeviceContext::PS(.*)", "PixelShaderStage", "PixelShader", "$1");
 gen.MoveMethodsToInnerInterface("ID3D11DeviceContext::GS(.*)", "GeometryShaderStage", "GeometryShader", "$1");
 gen.MoveMethodsToInnerInterface("ID3D11DeviceContext::SO(.*)", "StreamOutputStage", "StreamOutput", "$1");
 gen.MoveMethodsToInnerInterface("ID3D11DeviceContext::DS(.*)", "DomainShaderStage", "DomainShader", "$1");
 gen.MoveMethodsToInnerInterface("ID3D11DeviceContext::HS(.*)", "HullShaderStage", "HullShader", "$1");
 gen.MoveMethodsToInnerInterface("ID3D11DeviceContext::RS(.*)", "RasterizerStage", "Rasterizer", "$1");
 gen.MoveMethodsToInnerInterface("ID3D11DeviceContext::OM(.*)", "OutputMergerStage", "OutputMerger", "$1");
 gen.MoveMethodsToInnerInterface("ID3D11DeviceContext::CS(.*)", "ComputeShaderStage", "ComputeShader", "$1");

8) Dispatch method to function group

DirectX C++ functions are mapped to a set of function group and an associated DLL. For example, it is possible to specify that all D3D11.* methods will map to a class D3D11 containing all the associated methods.

// Function group
  var d3dCommonFunctionGroup = gen.CreateFunctionGroup("SharpDX", "SharpDX.Direct3D", "D3DCommon");
  var dxgiFunctionGroup = gen.CreateFunctionGroup("SharpDX.DXGI", "SharpDX.DXGI", "DXGI");
  var d3dFunctionGroup = gen.CreateFunctionGroup("SharpDX.D3DCompiler", "SharpDX.D3DCompiler", "D3D");
  var d3d11FunctionGroup = gen.CreateFunctionGroup("SharpDX.Direct3D11", "SharpDX.Direct3D11", "D3D11");
  var d3dx11FunctionGroup = gen.CreateFunctionGroup("SharpDX.Direct3D11", "SharpDX.Direct3D11", "D3DX11");

  // Map All D3D11 functions to D3D11 Function Group
  gen.MapFunctionToFunctionGroup(@"^D3D11.*", "d3d11.dll", d3d11FunctionGroup);

  // Map All D3DX11 functions to D3DX11 Function Group
  gen.MapFunctionToFunctionGroup(@"^D3DX11.*", group.Find<cppmacrodefinition>("D3DX11_DLL_A").FirstOrDefault().StripStringValue, d3dx11FunctionGroup);

  // Map All D3D11 functions to D3D11 Function Group
  string d3dCompilerDll =
      group.Find<cppmacrodefinition>("D3DCOMPILER_DLL_A").FirstOrDefault().StripStringValue;
  gen.MapFunctionToFunctionGroup(@"^D3DCreateBlob$", d3dCompilerDll, d3dCommonFunctionGroup);

If a DLL has a versionned name (like for D3DXX_xx.dll or D3DCompiler_xx.dll), we are directly retreiving the dll name from a macro!

Generate C# code from C# model and adding custom classes

Once an internal C# model is built, we are calling the T4 text template toolkit engine for each group of types : Enumerations, Structures, Interfaces, Functions. Those classes are then integrated in several VS project, with some custom code added and some non generated core classes.

The generated C# interop code

Meaning that for each assembly, each namespace, there will be an Enumerations.cs, Structures.cs, Interfaces.cs and Functions.cs files generated.

For each types, there is a custom mapping done:

For enums, the mapping is straightforward, resulting in an almost one-to-one mapping
For structures, the mapping is quite straightforward, resulting in an almost one-to-one mapping for most of the types. Although there are a couple of case where the mapping need to generate some marshalling code, essentially when there is a bool in the struct, or when there is a string pointer, or a fixed array of struct inside a struct.

For example, one of the most complex mapping for a structure is generated like this:

/// <summary> 
/// Describes the blend state. 
/// </summary> 
/// <remarks> 
/// These are the default values for blend state.StateDefault ValueAlphaToCoverageEnableFALSEIndependentBlendEnableFALSERenderTarget[0].BlendEnableFALSERenderTarget[0].SrcBlendD3D11_BLEND_ONERenderTarget[0].DestBlendD3D11_BLEND_ZERORenderTarget[0].BlendOpD3D11_BLEND_OP_ADDRenderTarget[0].SrcBlendAlphaD3D11_BLEND_ONERenderTarget[0].DestBlendAlphaD3D11_BLEND_ZERORenderTarget[0].BlendOpAlphaD3D11_BLEND_OP_ADDRenderTarget[0].RenderTargetWriteMaskD3D11_COLOR_WRITE_ENABLE_ALL Note that D3D11_BLEND_DESC is identical to {{D3D10_BLEND_DESC1}}.If the driver type is set to <see cref="SharpDX.Direct3D.DriverType.Hardware"/>, the feature level is set to less than or equal to <see cref="SharpDX.Direct3D.FeatureLevel.Level_9_3"/>, and the pixel formatofthe render target is set to <see cref="SharpDX.DXGI.Format.R8G8B8A8_UNorm_SRgb"/>, DXGI_FORMAT_B8G8R8A8_UNORM_SRGB, or DXGI_FORMAT_B8G8R8X8_UNORM_SRGB, the display device performs the blend in standard RGB (sRGB) space and not in linear space. However, if the feature level is set to greater thanD3D_FEATURE_LEVEL_9_3, the display device performs the blend in linear space. 
/// </remarks> 
/// <unmanaged>D3D11_BLEND_DESC</unmanaged>
public  partial struct BlendDescription { 
    
    /// <summary> 
    /// Determines whether or not to use alpha-to-coverage as a multisampling technique when setting a pixel to a rendertarget. 
    /// </summary> 
    /// <unmanaged>BOOL AlphaToCoverageEnable</unmanaged>
    public bool AlphaToCoverageEnable { 
        get { 
            return (_AlphaToCoverageEnable!=0)?true:false; 
        }
        set { 
            _AlphaToCoverageEnable = value?1:0;
        }
    }
    internal int _AlphaToCoverageEnable;
    
    /// <summary> 
    /// Set to TRUE to enable independent blending in simultaneous render targets.  If set to FALSE, only the RenderTarget[0] members are used. RenderTarget[1..7] are ignored. 
    /// </summary> 
    /// <unmanaged>BOOL IndependentBlendEnable</unmanaged>
    public bool IndependentBlendEnable { 
        get { 
            return (_IndependentBlendEnable!=0)?true:false; 
        }
        set { 
            _IndependentBlendEnable = value?1:0;
        }
    }
    internal int _IndependentBlendEnable;
    
    /// <summary> 
    /// An array of render-target-blend descriptions (see <see cref="SharpDX.Direct3D11.RenderTargetBlendDescription"/>); these correspond to the eight rendertargets  that can be set to the output-merger stage at one time. 
    /// </summary> 
    /// <unmanaged>D3D11_RENDER_TARGET_BLEND_DESC RenderTarget[8]</unmanaged>
    public SharpDX.Direct3D11.RenderTargetBlendDescription[] RenderTarget { 
        get { 
            if (_RenderTarget == null) {
                _RenderTarget = new SharpDX.Direct3D11.RenderTargetBlendDescription[8];
            }
            return _RenderTarget; 
        }
    }
    internal SharpDX.Direct3D11.RenderTargetBlendDescription[] _RenderTarget;

    // Internal native struct used for marshalling
    [StructLayout(LayoutKind.Sequential, Pack = 0 )]
    internal unsafe partial struct __Native { 
        public int _AlphaToCoverageEnable;
        public int _IndependentBlendEnable;
        public SharpDX.Direct3D11.RenderTargetBlendDescription RenderTarget;
        SharpDX.Direct3D11.RenderTargetBlendDescription __RenderTarget1;
        SharpDX.Direct3D11.RenderTargetBlendDescription __RenderTarget2;
        SharpDX.Direct3D11.RenderTargetBlendDescription __RenderTarget3;
        SharpDX.Direct3D11.RenderTargetBlendDescription __RenderTarget4;
        SharpDX.Direct3D11.RenderTargetBlendDescription __RenderTarget5;
        SharpDX.Direct3D11.RenderTargetBlendDescription __RenderTarget6;
        SharpDX.Direct3D11.RenderTargetBlendDescription __RenderTarget7;
    // Method to free native struct
        internal unsafe void __MarshalFree()
        {
        }
    }

    // Method to marshal from native to managed struct
    internal unsafe void __MarshalFrom(ref __Native @ref)
    {            
        this._AlphaToCoverageEnable = @ref._AlphaToCoverageEnable;
        this._IndependentBlendEnable = @ref._IndependentBlendEnable;
        fixed (void* __to = &this.RenderTarget[0]) fixed (void* __from = &@ref.RenderTarget) SharpDX.Utilities.CopyMemory((IntPtr) __to, (IntPtr) __from, 8*sizeof ( SharpDX.Direct3D11.RenderTargetBlendDescription));
    }
    // Method to marshal from managed struct tot native
    internal unsafe void __MarshalTo(ref __Native @ref)
    {
        @ref._AlphaToCoverageEnable = this._AlphaToCoverageEnable;
        @ref._IndependentBlendEnable = this._IndependentBlendEnable;
        fixed (void* __to = &@ref.RenderTarget) fixed (void* __from = &this.RenderTarget[0]) SharpDX.Utilities.CopyMemory((IntPtr) __to, (IntPtr) __from, 8*sizeof ( SharpDX.Direct3D11.RenderTargetBlendDescription));

}
}

For Interfaces the mapping is quite complex, because it is necessary to handle lost of different cases:

Optionnal structure in input
Optionnal parameters
Output an array of interface
Perform some custom marshaling (for example, with the previous BlendDescription structure)
Generating properties for methods that are property elligible
...etc.

For example, the method using the BlendDescription is like this:

/// <summary> 
/// Create a blend-state object that encapsules blend state for the output-merger stage. 
/// </summary> 
/// <remarks> 
/// An application can create up to 4096 unique blend-state objects. For each object created, the runtime checks to see if a previous object  has the same state. If such a previous object exists, the runtime will return a pointer to previous instance instead of creating a duplicate object. 
/// </remarks> 
/// <param name="blendStateDescRef">Pointer to a blend-state description (see <see cref="SharpDX.Direct3D11.BlendDescription"/>).</param>
/// <param name="blendStateRef">Address of a pointer to the blend-state object created (see <see cref="SharpDX.Direct3D11.BlendState"/>).</param>
/// <returns>This method returns E_OUTOFMEMORY if there is insufficient memory to create the blend-state object.   See {{Direct3D 11 Return Codes}} for other possible return values.</returns>
/// <unmanaged>HRESULT CreateBlendState([In] const D3D11_BLEND_DESC* pBlendStateDesc,[Out, Optional] ID3D11BlendState** ppBlendState)</unmanaged>
public SharpDX.Result CreateBlendState(ref SharpDX.Direct3D11.BlendDescription blendStateDescRef, out SharpDX.Direct3D11.BlendState blendStateRef){
    unsafe {
        SharpDX.Direct3D11.BlendDescription.__Native blendStateDescRef_ = new SharpDX.Direct3D11.BlendDescription.__Native();

blendStateDescRef.__MarshalTo(ref blendStateDescRef_);
        IntPtr blendStateRef_ = IntPtr.Zero;
        SharpDX.Result __result__;
        __result__= (SharpDX.Result)SharpDX.Interop.CalliInt(_nativePointer, 20 * 4, &blendStateDescRef_, &blendStateRef_);

        blendStateDescRef.__MarshalFree();

blendStateRef = (blendStateRef_ == IntPtr.Zero)?null:new SharpDX.Direct3D11.BlendState(blendStateRef_); 
        __result__.CheckError();
        return __result__;
    }
}

In the previous example, you can see that the input BlendDescription structure is in fact marshalled to an intermediate native structure suitable for unmanaged code (internal __Native struct for BlendDescription). The marshall code is also responsible to free the native struct (if there are any allocations, like for strings).

The marshalling has some nice optimizations, like for passing struct by value or by reference : All the methods in C++ are using a pointer for a struct (for getting and setting), but with the marshaller, we can decide if we want to have a struct passed by value or by ref. Currently, the generator is calculating the size of the valuetype. If the valuetype is less or equal 16 bytes, the valuetype is passed by value, otherwise it's passed by ref.

A more standard interface with simple marshalling is like this: (Note for example the GUID integrated, the properties auto-generated from methods, and methods that are hidden from the public API)

/// <summary> 
/// This interface is used to return arbitrary length data. 
/// </summary> 
/// <unmanaged>ID3D10Blob</unmanaged>
[Guid("8ba5fb08-5195-40e2-ac58-0d989c3a0102")]
public partial class Blob : SharpDX.ComObject {

    public Blob(IntPtr basePtr) : base(basePtr) {
    }
    
    
    /// <summary> 
    /// Get a pointer to the data. 
    /// </summary> 
    /// <unmanaged>void* GetBufferPointer()</unmanaged>
    public IntPtr BufferPointer {
            get { return GetBufferPointer(); }
    }
    
    /// <summary> 
    /// Get the size. 
    /// </summary> 
    /// <unmanaged>SIZE_T GetBufferSize()</unmanaged>
    public SharpDX.Size BufferSize {
            get { return GetBufferSize(); }
    }
    
    /// <summary> 
    /// Get a pointer to the data. 
    /// </summary> 
    /// <returns>Returns a pointer.</returns>
    /// <unmanaged>void* GetBufferPointer()</unmanaged>
    internal IntPtr GetBufferPointer() {
        unsafe {
            IntPtr __result__;
            __result__= (IntPtr)SharpDX.Interop.CalliPtr(_nativePointer, 3 * 4);
            return __result__;
        }
    }
    
    /// <summary> 
    /// Get the size. 
    /// </summary> 
    /// <returns>The size of the data, in bytes.</returns>
    /// <unmanaged>SIZE_T GetBufferSize()</unmanaged>
    internal SharpDX.Size GetBufferSize() {
        unsafe {
            SharpDX.Size __result__;
            __result__= (SharpDX.Size)SharpDX.Interop.CalliPtr(_nativePointer, 4 * 4);
            return __result__;
        }
    }
}

For functions, the mapping is quite straightforward, because we are relying on a plain pinvoke interop. This was the easiest choice and easier to generate. Although pInvoke calls are still hidden in order to perform some parameter transformation, mostly in order to support the custom COM Object model generated.

A function call is generated like this:

/// <unmanaged>HRESULT D3D11CreateDevice([In, Optional] IDXGIAdapter* pAdapter,[None] D3D_DRIVER_TYPE DriverType,[None] HMODULE Software,[None] D3D11_CREATE_DEVICE_FLAG Flags,[In, Buffer, Optional] const D3D_FEATURE_LEVEL* pFeatureLevels,[None] UINT FeatureLevels,[None] UINT SDKVersion,[Out,Optional] ID3D11Device** ppDevice,[Out, Optional] D3D_FEATURE_LEVEL* pFeatureLevel,[Out, Optional] ID3D11DeviceContext** ppImmediateContext)</unmanaged>
public static SharpDX.Result CreateDevice(SharpDX.DXGI.Adapter adapterRef, SharpDX.Direct3D.DriverType driverType, IntPtr software, SharpDX.Direct3D11.DeviceCreationFlags flags, SharpDX.Direct3D.FeatureLevel[] featureLevelsRef, int featureLevels, int sDKVersion, out SharpDX.Direct3D11.Device deviceRef, out SharpDX.Direct3D.FeatureLevel featureLevelRef, out SharpDX.Direct3D11.DeviceContext immediateContextRef) {
    unsafe {
        IntPtr deviceRef_ = IntPtr.Zero;
        IntPtr immediateContextRef_ = IntPtr.Zero;
        SharpDX.Result __result__;
        __result__= (SharpDX.Result)D3D11CreateDevice_((adapterRef == null)?IntPtr.Zero:adapterRef.NativePointer,  driverType,  software,  flags, featureLevelsRef,  featureLevels,  sDKVersion, out deviceRef_, out featureLevelRef, out immediateContextRef_);
        deviceRef = (deviceRef_ == IntPtr.Zero)?null:new SharpDX.Direct3D11.Device(deviceRef_);
        immediateContextRef = (immediateContextRef_ == IntPtr.Zero)?null:new SharpDX.Direct3D11.DeviceContext(immediateContextRef_);
        __result__.CheckError();
        return __result__;
    }
}

/// <summary>Native Interop Function</summary>
/// <unmanaged>HRESULT D3D11CreateDevice([In, Optional] IDXGIAdapter* pAdapter,[None] D3D_DRIVER_TYPE DriverType,[None] HMODULE Software,[None] D3D11_CREATE_DEVICE_FLAG Flags,[In, Buffer, Optional] const D3D_FEATURE_LEVEL* pFeatureLevels,[None] UINT FeatureLevels,[None] UINT SDKVersion,[Out,Optional] ID3D11Device** ppDevice,[Out, Optional] D3D_FEATURE_LEVEL* pFeatureLevel,[Out, Optional] ID3D11DeviceContext** ppImmediateContext)</unmanaged>
[DllImport("d3d11.dll", EntryPoint = "D3D11CreateDevice", CallingConvention = CallingConvention.StdCall, PreserveSig = true), SuppressUnmanagedCodeSecurityAttribute]
private extern static SharpDX.Result D3D11CreateDevice_(IntPtr adapterRef, SharpDX.Direct3D.DriverType driverType, IntPtr software, SharpDX.Direct3D11.DeviceCreationFlags flags, SharpDX.Direct3D.FeatureLevel[] featureLevelsRef, int featureLevels, int sDKVersion, out IntPtr deviceRef, out SharpDX.Direct3D.FeatureLevel featureLevelRef, out IntPtr immediateContextRef);

Extend the model in C#

All those classes are then integrated in a VS solution with 4 assemblies:

A core assembly that contains non generated code (ComObject, DataStream, Vectors, Utilities...) and common enumeration and structs for Direct3D (structures that are usually shared between D3D10, D3D10.1 and D3D11).
An assembly for DXGI that has a dependency to the core assembly
An assembly for D3DCompiler that has a dependency to the core assembly
An assembly for D3D11 that has a dependency to the core, DXGI and D3DCompiler

In order to quickly develop this new Wrapper, I have taken lots of portion of code from SlimDX, using the same design philosophy, mainly the Slim.Math assembly in order to have all the Vectors and math functions ready-to-use. The only difference is that I have moved Vectors*/Matrix class to the main core, while still leaving higher level math classes to a separate Math assembly (BoudingSphere, Plain, intersection calculation... etc.)

You may have noticed that all the generated class are tagged with the C# keyword "partial", making extension quite easy to integrate.

Why do we need extensions? Well, Direct3D 11 API is sometimes not easy to use, there are a couple of redundancy that doesn't map well to C#. For example, methods are taking an array of structure + the size of this array => In C#, you would pass the array, and the size will be inferred from that... this is not strictly equivalent to C++, because you could pass an array larger than the number of elements you want to effectively pass, but this is the most common way the API is going to be used... so...

For example, to create a DXGI Factory, you should have to call DXGICreateFactory... because we don't need to expose directly those functions, the DXGICreateFactory are tagged with internal keyword and I have added a new constructor to the DXGI Factory like this:

using System;
using System.Runtime.InteropServices;

namespace SharpDX.DXGI
{
    public partial class Factory
    {
        /// 
        /// Default Constructor for Factory
        /// 
        public Factory() : base(IntPtr.Zero)
        {
            IntPtr factoryPtr;
            DXGI.CreateDXGIFactory(GetType().GUID, out factoryPtr);
            NativePointer = factoryPtr;
        }

Finally in a assembly project, you have:

Generated classes : Enumerations.cs, Structures.cs, Interfaces.cs, Functions.cs
Extension classes : They are placed in a subdirectory Extension with the filename of the extended class .e.g. Factory.cs
Non generated classes : For example, VertexBufferBinding which is used by a custom SetVertexBuffers in order to set strides, offsets and buffers in a more friendly way like :

context.InputAssembler.SetVertexBuffers(0, new VertexBufferBinding(vertices, 32, 0));

Example of ported SlimDX MiniTri sample

Here is a port of MiniTri D3D11 sample to this new API. You could verify that the API is really close to SlimDX experience...

using System;
using SharpDX;
using SharpDX.Direct3D;
using SharpDX.Direct3D11;
using SharpDX.DXGI;
using SharpDX.Windows;
using SharpDX.D3DCompiler;
using Buffer = SharpDX.Direct3D11.Buffer;
using Device = SharpDX.Direct3D11.Device;

namespace MiniTri
{
    /// <summary>
    /// SharpDX port of SlimDX-MiniTri Direct3D 11 Sample
    /// </summary>
    static class Program
    {
        [STAThread]
        static void Main()
        {
            var form = new RenderForm("SharpDX - MiniTri Direct3D 11 Sample");

            // SwapChain description
            var desc = new SwapChainDescription()
            {
                BufferCount = 1,
                BufferDescription =  new ModeDescription(form.ClientSize.Width, form.ClientSize.Height, new Rational(60, 1), Format.R8G8B8A8_UNorm),
                Windowed = true,
                OutputWindow = form.Handle,
                SampleDescription = new SampleDescription(1,0),
                SwapEffect = SwapEffect.Discard,
                BufferUsage = Usage.RenderTargetOutput
            };
                                                    

            // Create Device and SwapChain
            Device device;
            SwapChain swapChain;
            Device.CreateWithSwapChain(DriverType.Hardware, DeviceCreationFlags.Debug, desc, out device, out swapChain);            
            var context = device.ImmediateContext;
            
            // Ignore all windows events
            Factory factory = swapChain.GetParent<Factory>();
            factory.MakeWindowAssociation(form.Handle, WindowAssociationFlags.IgnoreAll);

            // New RenderTargetView from the backbuffer
            Texture2D backBuffer = Texture2D.FromSwapChain<Texture2D>(swapChain, 0);
            var renderView = new RenderTargetView(device, backBuffer);

            // Compile Vertex and Pixel shaders
            var vertexShaderByteCode = ShaderBytecode.CompileFromFile("MiniTri.fx", "VS", "vs_4_0", ShaderFlags.None, EffectFlags.None);
            var vertexShader = new VertexShader(device, vertexShaderByteCode);

            var pixelShaderByteCode = ShaderBytecode.CompileFromFile("MiniTri.fx", "PS", "ps_4_0", ShaderFlags.None, EffectFlags.None);
            var pixelShader = new PixelShader(device, pixelShaderByteCode);

            // Layout from VertexShader input signature
            var layout = new InputLayout(device, ShaderSignature.GetInputSignature(vertexShaderByteCode), new[] {
                new InputElement("POSITION", 0, Format.R32G32B32A32_Float, 0, 0),
                new InputElement("COLOR", 0, Format.R32G32B32A32_Float, 16, 0) 
            });

            // Write vertex data to a datastream
            var stream = new DataStream(32 * 3, true, true);
            stream.WriteRange(new[] {
                new Vector4(0.0f, 0.5f, 0.5f, 1.0f), new Vector4(1.0f, 0.0f, 0.0f, 1.0f),
                new Vector4(0.5f, -0.5f, 0.5f, 1.0f), new Vector4(0.0f, 1.0f, 0.0f, 1.0f),
                new Vector4(-0.5f, -0.5f, 0.5f, 1.0f), new Vector4(0.0f, 0.0f, 1.0f, 1.0f)
            });
            stream.Position = 0;

            // Instantiate Vertex buiffer from vertex data
            var vertices = new Buffer(device, stream, new BufferDescription()
            {
                BindFlags = BindFlags.VertexBuffer,
                CPUAccessFlags = CpuAccessFlags.None,
                MiscFlags = ResourceOptionFlags.None,
                SizeInBytes = 32 * 3,
                Usage = ResourceUsage.Default,
                StructureByteStride = 0
            });
            stream.Release();

            // Prepare All the stages
            context.InputAssembler.InputLayout = layout;
            context.InputAssembler.PrimitiveTopology = PrimitiveTopology.Trianglelist;
            context.InputAssembler.SetVertexBuffers(0, new VertexBufferBinding(vertices, 32, 0));
            context.VertexShader.Set(vertexShader);
            context.Rasterizer.SetViewports(new Viewport(0, 0, form.ClientSize.Width, form.ClientSize.Height, 0.0f, 1.0f));
            context.PixelShader.Set(pixelShader);
            context.OutputMerger.SetTargets(renderView);

            // Main loop
            MessagePump.Run(form, () =>
            {
                context.ClearRenderTargetView(renderView, new Color4(1.0f, 0.0f, 0.0f, 0.0f));
                context.Draw(3, 0);
                swapChain.Present(0, PresentFlags.None);
            });

            // Release all resources
            vertexShaderByteCode.Release();
            vertexShader.Release();
            pixelShaderByteCode.Release();
            pixelShader.Release();
            vertices.Release();
            layout.Release();
            renderView.Release();
            backBuffer.Release();
            context.ClearState();
            context.Flush();
            device.Release();
            context.Release();
            swapChain.Release();
            factory.Release();
        }
    }
}

Next?

Wow, this was not supposed to be a so long post! I have been a bit into the internals of the generator and It may not be interesting for a general audience, but at least I have taken some time to put this down on a paper, to clarify things.

Although, I have not detailled everything. For example, you have probably noticed from the previous example that I'm still not using the D3D11 Effects11 API. Well, the problem is that Microsoft has removed the Effects API from D3D11. Why? Probably because the code is hidding too much about how you could interact properly (and more effitiently) with D3D11 API. But this is one decision I don't fully agree : Look at XNA 4.0 : They have removed the used of VertexShader, PixelShader directly in favor of Effects classes... In one API, they are no longer supporting it, in another, they are making it the only and mandatory one... Some could argue that XNA doesn't have the same target... but still, from a software design perspective, I'm quite doubtful.

The great news is that looking at the C++ Effects11 sample, I have been able to port the most interesting part : decoding an Effect bytecode to extract usefull information, like constant buffers, techniques, stages, shader's bytecodes...etc. I'm not going to support the whole fx_5_0 profile, because I'm usually using a subset of this : for example, I don't find practical to declare samplers state, blending...etc. in the shader and I do prefer to have them instantiated from the C# code. On the other hand, I like a lot the way the Effects library is encapsulating constant buffer and shader resource view binding to shader stages. This is one of the most laborious things to do if you are going with the raw Direct3D 11 interface. So if I could have an Effect framework supporting at least techniques, pass and proper automatic constant buffer and SRV bindings, I would be very happy. This part will deserve another post!

Also, working more with SlimDX and this new API wrapper, I have been working with a XNA like API on top of a Direct3D 11 API, and It was in fact really easy to achieve (of course, without the content pipeline, which is the true benefit of XNA). Why do we need such a higher API? Well, Direct3D 11 is really powerful with its buffer/resource management, but the fact is that it's much more verbose. But think about it : When you use a Texture2D, you will need most of the time a ShaderResourceView on it.... If you want a texture2D as a render target, you will probably need a RenderTargetView, and because It's a RenderTarget, you will probably use this RenderTarget as a ShaderResourceView for another pass... So in the end, there are lots of things that can be handled in the background, even if you are using a Direct3D 11 API. The nice thing about this kind of API is that you can play with some geometry or compute shaders, while still having the pleasure to work with a high level API. This will also probably be part of a post!

So, what's next? I just finished the mapping and the port of the MiniTri yesterday. The current wrapper is probably not yet fully usable and doesn't have the same level of API richness than SlimDX. Threre are still lots of -small- extensions code to add to make the coding experience better than a somewhat raw D3D11 API. Within the next days, I'm going to play much more with this new wrapper and see how far can it go...

(note: 1st draft version of this document)

Making of Ergon 4K PC Intro

2010-08-26T00:28:00.021+11:00

You are not going to discover any fantastic trick here, the intro itself is not an outstanding coding performance, but I always enjoy reading the making of other intros, so It's time to take some time to put this on a paper!

What is Ergon? It's a small 4k intro (meaning 4096 byte executable) that was released at the 2010 Breakpoint demoparty (if you can't run it on your hardware, you can still watch it on youtube), which surprisingly was able to finish to the 3rd place! I did the coding, design and worked also on the music with my friend ulrick.

That was a great experience even if I didn't expect to work on this production at the beginning of the year... but at the end of January, when BP2010 was announced and supposed to be the last one, I was motivated to go there, and why not, release a 4k intro! One month and a half later, the demo was almost ready... wow, 3 weeks before the party, first time to finish something so ahead an event! But yep, I was able to work on it on part time during the week (and the night of course)... But when I started on it, I had no idea where this project would bring me to... or even what kind of 3D API I had to start from doing this intro!

OpenGL, DirectX 9, 10 or 11?

At FRequency, xt95 is mainly working in OpenGL, mostly due to the fact that he is a linux user. All our previous intros were done using OpenGL, although I did provide some help on some intros, bought OpenGL books few years ago... I'm not a huge fan of the OpenGL C API, but most importantly, from my short experience on this, I was always able to better strip down DirectX code size than OpenGL code... At that time, I was also working a bit more on DirectX API... I even bought a 5770 ATI earlier to be able to play with D3D11 Compute Shader api... I'm also mostly a windows user... DirectX has a very well integrated documentation in Visual Studio, a good SDK, lots of samples inside, a cleaner API (more true on recent D3D10/D3D11), some cool tools like PIX to debug shaders... and thought also that programming on DirectX on windows might reduce the risk to get some incompatibilities between NVidia and ATI graphics card (although, I found that, at least with D3D9, this is not always true...).

So ok, DirectX was selected... but which version? I started my first implementation with D3D10. I know that the code is much more verbose than D3D9 and OpenGL2.0, but I wanted to practice it a bit more the somehow "new" API than just reading a book about it. I was also interested to plug some text in the demo and tried an integration with latest Direct2D/DirectWrite API.

Everything went well at the beginning with D3D10 API. The code was clean, thanks to the thin layer I developed around DirectX to make the coding experience much closer to what I use to have in C# with SlimDx for example. The resulting C++ code was something like this :

//
  // Set VertexBuffer for InputAssembler Stage
  device.InputAssembler.SetVertexBuffers(screen.vertexBuffer, sizeof(VertexDataOffline)); 

  // Set TriangleList PrimitiveTopology for InputAssembler Stage
  device.InputAssembler.SetPrimitiveTopology(PrimitiveTopology::TriangleStrip);

  // Set VertexShader for the current Pass
  device.VertexShader.Set(effect.vertexShader);

Very pleasant to develop with it, but because I wanted to test D2D1, I switched to D3D10.1 which can be configured to run on D3D10 hardware (with the feature level thing)... So I also started to slightly wrap up the Direct2D API and was able to produce very easily some really nice text... but wow... the code was a bit too large for a 4k (but would be perfect for a 64k).

Then during this experiment phase, I tried the D3D11 API with the Compute Shader thing... and found that the code is much more compact than D3D10 if you are performing some kind of... for example, raymarching... I didn't compare code size, but I suspect the code to be able to compete with its D3D9 counterpart (although, there is a downside in D3D11 : if you can afford a real D3D11 hardware, a compute shader can directly render to the screen buffer... otherwise, using the D3D11 Compute shader with features level 10, you have to copy the result from one resource to another... which might hit the size benefit...).

I was happy to see that the switch to D3D11 was easy, with some continuity from D3D10 on the API "look & feel"... Although I was disappointed to learn that working this D3D11 and D2D1 was not straightforward because D2D1 is only compatible with D3D10.1 API (which you can run with feature level 9.0 to 10), forcing to initialize and maintain two devices (one for D3D10.1 and one for D3D11), playing with DXGI shared resource between the devices... wow, lots of work, lots of code... and of course, out of question for a 4k...

So I tried... a plain old good D3D9... and that was of course much compact in size than their D3D10 counterpart... So for around two weeks in February, I played with those various API while implementing some basic scene for the intro.I just had a bad surprise when releasing the intro, because lots of people were not able to run it : weird because I was able to test it on several NVidias and at least my ATI 5770... I didn't expect D3D9 to be so sensitive to that, or at least, a bit less sensitive than OpenGL... but I was wrong.

Raymarching optimization

I decided to go for an intro using the raymarching algorithm that was more likely to be able to deliver a "fat" content in a tiny amount of code. Although, the raymarching stuff was already a bit in the "retired", after the fantastic intros released earlier in 2009 (Elevated - not really a raymarching intro but soo impressive!, Sult, Rudebox, Muon-Baryon...etc). But I didn't have enough time to explore a new effect and was not even confident to be able to find anything interesting at that time... so... ok, raymarching.

So for one week, after building a 1st scene, I spent my time to try to optimize the raymarching algo. There was an instructive thread on pouet about this : "So, what do distance field equations look like? And how do we solve them?". I tried to implement some trick like...

Generate grid on the vertex shader (with 4x4 pixels for example), to precompute a raw view of the scene, storing the minimal distance step to go before hitting a surface... let the pixel shader to get those interpolate distances (multiplied by a small reduction factor like .9f) and perform some fine grained raymarching with fewer iterations
Generate a pre-rendered 3D volume of the scene at a much lower density (like 96x96x96) and use this map to navigate in the distance fields while still performing some "sphere tracing" refinement if needed
I tried also somekind of level of detail on the scene : for example, instead of having a texture lookup (for the "bump mapping") for each step during the raymarching, allow the raymarcher to use a simplified analytical surface scene and switch to the more detailled one for the last step

Well, I have to admit that all those techniques were not really clever in anyway... and the result was matching the lack of this cleverness! None of them provide a significant speed optimization compare to the code size hit they were generated.

So after one week of optimization, well, I just went to a basic raymarcher algo. The shader was developed under Visual C++, integrated in the project (thanks to NShader syntax highlighting). I did a small C# tool to strip the shader comments, remove unnecessary spaces... integrated in the build (pre-build events in VC++), It's really enjoyable to work with this toolchain.

Scenes design

For the scenes, I decided to use the same kind of technique used in the Rudebox 4k intro : Leveraging more on the geometry and lights, but not on the materials. That made the success of the rudebox and I was motivated to build some complex CSG with boolean operations on basic elements (box, sphere...etc.). The nice thing about this approach is that It avoids to use inside the ISO surface anykind of if/then/else for determining the material... just letting the lights properly set in the scene might do the work. Yep, indeed, rudebox is for example a scene with say, a white material for all the objects. What makes the difference is the position of lights in the scene, their intensity...etc. Ergon used the same trick here.

I spent around two to three weeks to build the scenes. I ended up with 4 scenes, each one quite cool on their own, with a consistent design among them. One of the scene was using the fonts to render a wall of text in raymarching.

Because I'm not sure that I will be able to use those scenes, well, I'm going to post their screenshot here!

The 1st scene I developed during my D3D9/D3D10/D3D11 API experiments was a massive tentacle model coming from a balckhole. All the tentacles were moving around a weird cutted sphere, with a central "eye"... I was quite happy about this scene that had a unique design. From the beginning, I wanted to add some post-processing, to enhance the visuals, and to make them a bit different from other raymarching scene... So I went with a simple post-processing that was performing some patterns on the pixels, adding a radial blur to produce some kind of "ghost rays" coming out from the scene, making the corners darker, and adding a small flickering the more you go to the corners. Well, only this piece of code was already taking a scene on its own, but that was the price to have a genuine ambiance, so...

The colors and theming was almost settled from the beginning... I'm a huge fan of warm colors!

The 2nd scene was using a font rendering coupling with the raymarcher.... a kind of flying flag, with the logo FRequency appearing from left to right with a light on it... (I will probably release those effects on pouet just for the record...), that was also a fresh use of raymarching... didn't see anything like this in latest 4k production, so, I was expecting to insert this text in the 4k, as It's not so common... The code to use the d3d font was not too fat... so I was still confident to be able to use those 2 scenes.

After that, I was looking for some nasty objects... so for the 3rd scene, I tried to randomly play with some weird functions and ended up with a kind of "raptor" creature... I wanted also to use a weird generated texture I found few month ago, that was perfect for it.

Finally, I wanted to use the texture to make a kind of lava sea with a moving snake on it... that was the last scene I coded (and of course, 2 others scenes that are too ugly to show here! :) ).

We also started at that time, in February, to work on the music, and as I explained in my earlier posts, we used 4klang synth for the intro. But making all those scenes with a music prototype, the "crinklered" compressed exe was more around 5ko... even If the shader code was already optimized in size, using some kind of preprocessor templating (like in rudebox or receptor). The intro was of course laking a clear direction, there was no transitions between the scenes... and most importantly, It was not possible to fit all those scenes in 4k, while expecting the music to grow a little bit more in the final exe...

The story of the Worm-Lava texture

Last year, around November, while I was playing with several perlin's like noise, I found an interesting variation using perlin noise and the marble-cosine effect that was able to represents some kind of worms, quite freaking ugly in some way, but that was a unique texture effect!

(Click to enlarge, lots of details in it!)

This texture was primarily developed in C# but the code was quite straightforward to port in a texture shader... Yep, that's probably an old trick with D3D9 to use the function D3DXFillTextureTX to directly fill a texture from a shader with a single line of code... Why using this? Because It was the only way to get a noise() function accessible from a shader, without having to implement it... As weird as it may sounds, the HLSL perlin noise() function is not accessible outside a texture shader. A huge drawback of this method is also that the shader is not a real GPU shader, but is instead computed on the CPU... that explain why ergon intro is taking so long to generate the texture at the beginning (with a 1280x720 texture resolution for example).

So how does look this texture shader in order to generate this texture?

// -------------------------------------------------------------------------
// worm noise function
// -------------------------------------------------------------------------
#define ty(x,y) (pow(.5+sin((x)*y*6.2831)/2,2)-.5)
#define t2(x,y) ty(y+2*ty(x+2*noise(float3(cos((x)/3)+x,y,(x)*.1)),.3),.7)
#define tx(x,y,a,d) ((t2(x, y) * (a - x) * (d - y) + t2(x - a, y) * x * (d - y) + t2(x, y - d) * (a - x) * y + t2(x - a, y - d) * x * y) / (a * d))

float4 x( float2 x : position, float2 y : psize) : color {
   float a=0,d=64;
   // Modified FBM functions to generate a blob texture
   for(;d>=2;d/=2)
       a += abs(tx(x.x*d,x.y*d,d,d)/d);
   return a*2;   
}

The tx macro is basically applying a tiling on the noise.
The core t2 and ty macros are the one that are able to generate this "worm-noise". It's in fact a tricky combination of the usual cosine perlin noise. Instead of having something like cos(x + noise(x,y)), I have something like special_sin( y + special_sin( x + noise(cos(x/3)+x,y), power1), power2), with special_sin function like ((1 + sin(x*power*2*PI))/2) ^ 2

Also, don't be afraid... this formula didn't came out of my head like this... that was clearly after lots of permutations from the original function, with lots of run/stop/change_parameters steps! :D

Music and synchronization

It took some time to build the music theme and to be satisfied with it... At the beginning, I let ulrick making a first version of the music... But because I had a clear view of the design and direction, I was expecting a very specific progression in the tune and even in the chords used... That was really annoying for ulrick (excuse-me my friend!), as I was very intrusive in the composition process... At some point, I ended up in making a 2 pattern example of what I wanted in terms of chords and musical ambiance... and ulrick was kind enough to take this sample pattern and clever to add some intro's musical feeling in it. He will be able to talk about this better than me, so I'll ask him if he can insert some small explanation here!

ulrick here: « working with @lx on this prod was a very enjoyable job. I started a music which @lx did not like very much, it did not reflect the feelings that @lx wanted to give through the Ergon. He thus composed a few patterns using a very emotional musical scale. I entered into the music very easily and added my own stuffs. For the anecdote, I added a second scale to the music to allow for a clearer transition between the first and second parts of the Ergon. After doing so, we realized that our music actually used the chromatic scale on E »

The synchronization was the last part of the work in the demo. I first used the default synchronization mechanism from the 4klang... but I was lacking some features like, if the demo is running slowly, I needed to know exactly where I was... Using plain 4klang sync, I was missing some events on slow hardware, even preventing the intro to switch between the scenes, because the switching event was missed by the rendering loop!

So I did my own small synchronization based on regular events of the snare and a reduce view of the sample patterns for this particular events. This is the only part of the intro that was developed in x86 assembler in order to keep it as small as possible.

The whole code was something like this :

static float const_time = 0.001f;
  static int SAMPLES_PER_DRUMS = SAMPLES_PER_TICK*16;
  static int SAMPLES_PER_DROP_DRUMS = SAMPLES_PER_TICK*4;
  static int SMOOTHSTEP_FACTOR = 3;

  static unsigned char drum_flags[96] = {
        // pattern n°   time        z.z sequence
   1,1,1,1,   // pattern 0 0         0 0
   1,1,1,1,   // pattern 1 7,384615385 4 1
   0,0,0,0,   // pattern 2 14,76923077 8 2
   0,0,0,0,   // pattern 3 22,15384615 12 3
   0,0,0,0,   // pattern 4 29,53846154 16 4
   0,0,0,0,   // pattern 5 36,92307692 20 5
   0,0,0,0,   // pattern 6 44,30769231 24 6
   0,0,0,0,   // pattern 7 51,69230769 28 7
   0,0,0,1,   // pattern 8 59,07692308 32 8
   0,0,0,1,   // pattern 8 66,46153846 36 9
   1,1,1,1,   // pattern 9 73,84615385 40 10
   1,1,1,1,   // pattern 9 81,23076923 44 11
   1,1,1,1,   // pattern 10 88,61538462 48 12
   0,0,0,0,   // pattern 11 96         52 13
   0,0,0,0,   // pattern 2 103,3846154 56 14
   0,0,0,0,   // pattern 3 110,7692308 60 15
   0,0,0,0,   // pattern 4 118,1538462 64 16
   0,0,0,0,   // pattern 5 125,5384615 68 17
   0,0,0,0,   // pattern 6 132,9230769 72 18
   0,0,0,0,   // pattern 7 140,3076923 76 19
   0,0,0,1,   // pattern 8 147,6923077 80 20
   1,1,1,1,   // pattern 12 155,0769231 84 21
   1,1,1,1,   // pattern 13 162,4615385 88 22
  };          

  // Calculate time, synchro step and boom shader variables
  __asm {
   fild dword ptr [time]      // st0 : time
   fmul dword ptr [const_time]     // st0 = st0 * 0.001f
   fstp dword ptr [shaderVar.x]    // shaderVar.x = time * 0.001f
   mov eax, dword ptr [MMTime.u.sample]
   cdq
   sub eax, SAMPLES_PER_TICK*8
   jae not_first_drum
   xor eax,eax
not_first_drum:
   idiv dword ptr [SAMPLES_PER_DRUMS]   // eax = drumStep , edx = remainder step
   mov dword ptr [drum_step], eax
   fild dword ptr [drum_step]
   fstp dword ptr [shaderVar.z]    // shaderVar.z = drumStep
   
not_end: cmp byte ptr [eax + drum_flags],0
   jne no_boom

   mov eax, SAMPLES_PER_TICK*4
   sub eax,edx
   jae boom_ok
   xor eax,eax
boom_ok:
   mov dword ptr [shaderVar.y],eax
   fild dword ptr [shaderVar.y]
   fidiv dword ptr [SAMPLES_PER_DROP_DRUMS] // st0 : boom
   fild dword ptr [SMOOTHSTEP_FACTOR]   // st0: 3, st1-4 = boom
   fsub st(0),st(1)       // st0 : 3 - boom , st1-3 = boom
   fsub st(0),st(1)       // st0 : 3 - boom*2, st1-2 = boom
   fmul st(0),st(1)       // st0 : boom * (3-boom*2), st1 = boom
   fmulp st(1),st(0)
   fstp dword ptr [shaderVar.y]
no_boom:
  };

That was smaller then what I was able to do with pure 4klang sync... with the drawback that the sync was probably too simplistic... but I couldn't afford more code for the sync... so...

Final mixing

Once the music was almost finished, I spent a couple of days to work on the transitions, sync, camera movements. Because It was not possible to fit the 4 scenes, I had to mix the scene 3 (the raptor) and 4 (the snake and the lava sea), found a way to put a transition through a "central brain". Ulrick wanted to put a different music style for the transition, I was not confident with it... until I put the transition in action, letting the brain collapsed while the space under it was digging all around... and the music was fitting very well! cool!

I did also use a simple big shader for the whole intro, with some if (time < x) then scene_1 else scene_2...etc. I didn't expect to do this, because this is hurting the performance in the pixel shader to do this kind of branch processing... But I was really running out of space here and the only solution was in fact to use a single shader with some repetitive code. Here is an excerpt from the shader code : You can see how scene and camera management has been done, as well as for lights. This part was compressing quite well due to its repetitive pattern.

// -------------------------------------------------------------------------

// t3

// Helper function to rotate a vector. Usage :

// t3(mypoint.xz, .7);  <= rotate mypoint around Y axis with .7 radians
// -------------------------------------------------------------------------
float2 t3(inout float2 x,float y){
 return x=x*cos(y)+sin(y)*float2(-x.y,x.x);
}

// -------------------------------------------------------------------------
// v : main raymarching function
// -------------------------------------------------------------------------
float4 v(float2 x:texcoord):color{ 
 float a=1,b=0,c=0,d=0,e=0,f=0,i; 
 float3 n,o,p,q,r,s,t=0,y;
 int w;
 r=normalize(float3(x.x*1.25,-x.y,1));     // ray
 x = float2(.001,0);          // epsilon factor
 
 // Scene management
 if (z.z<39) {
  w = (z.z<10)?0:(z.z>26)?3+int(fmod(z.z,5)):int(fmod(z.z,3));

  //w=4;
  if (w==0) { p=float3(12,5+30*smoothstep(16,0,z.x),0);t3(r.yz,1.1*smoothstep(16,0,z.x));t3(r.xz,1.54); }
  if (w==1) { p=float3(-13,4,-8);t3(r.yz,.2);t3(r.xz,-.5);t3(r.xy,sin(z.x/3)/3); }
  if (w==2) { p=float3(0,8.5,-5);t3(r.yz,.2);t3(r.xy,sin(z.x/3)/5); }
  if (w==3) {
   p=float3(13+sin(z.x/5)*3,10+3*sin(z.x/2),0); 
   t3(r.yz, sin(z.x/5)*.6);
   t3(r.xz, 1.54+z.x/5);
   t3(r.xy, cos(z.x/10)/3);
   t3(p.xz,z.x/5);
  }

  if (w == 4) {
   p=float3(30+sin(z.x/5)*3,8,0);
   t3(r.yz, sin(z.x/5)/5);
   t3(r.xz, 1.54+z.x/3);
   t3(r.xy, sin(z.x/10)/3);
   t3(p.xz,z.x/3);
  } 

  if (w > 4) {
   p=float3(4.5,25+10*sin(z.x/3),0);
   t3(r.yz, 1.54*sin(z.x/5));
   t3(r.xz, .7+z.x/2);
   t3(r.xy, sin(z.x/10)/3);
   t3(p.xz,z.x/2);
  }  
 } else if (z.z<52) {
  p=float3(20,20,0);
  t3(r.yz, .9);
  t3(r.xz, 1.54+z.x/4);  
  t3(p.xz,z.x/4);
 } else if (z.z<81) {
  w = int(fmod(z.z,3));
  if (w==0 ) {
   p=float3(40+sin(z.x/5)*3,8,0);
   t3(r.yz, sin(z.x/5)/5);
   t3(r.xz, 1.54+z.x/3);
   t3(r.xy, sin(z.x/10)/3);
   t3(p.xz,z.x/3);
  }
  if (w==1 ) {
   p=float3(-10,30,0);
   t3(r.yz, 1.1);
   t3(r.xz, z.x/4);  
  }
  if (w==2 ) {
   p=float3(25+sin(z.x/5)*3,10+3*sin(z.x/2),0); 
   t3(r.yz, sin(z.x/5)/2);
   t3(r.xz, 1.54+z.x/5);
   t3(r.xy, cos(z.x/10)/3);
   t3(p.xz,z.x/5);   
  }
 } else { 
  p=float3(0,4,8);
  t3(r.yz,sin(z.x/5)/5);
  t3(r.xy,cos(z.x/4)/2);
  t3(r.xz,-1.54+smoothstep(0,4,z.x-155)*(z.x-155)/3);
 }

 
 // Boom effect on camera
 p.x+=z.y*sin(111*z.x)/4;
 
 // Lights 
 static float4 l[6] = {{.7,.2,0,2},{.7,0,0,3},{.02,.05,.2,7},
 {(4+10*step(24,z.z))*cos(z.x/5),-5,(4+10*step(24,z.z))*sin(z.x/5),0},
 {-30+5*sin(z.x/2),8,6+10*sin(z.x/2),0},
 {25,25,10,0}
 };

Compression statistics

Final compression results are given in the following table:

So to summarize, total exe size is 4070 bytes, and is composed of :

Synth code + music data is taking around 35% of the total exe size = 1461 bytes
Shader code is taking 36% = 1467 bytes
Main code + non shader data is 14% = 549 bytes
PE + crinkler decoder + crinkler import is 15% = 593 bytes

The intro was finished around the 13 march 2010, well ahead BP2010. So that was damn cool... I spent the rest of my time until BP2010 to try to develop a procedural 4k gfx, using D3D11 compute shaders, raymarching and a Global Illumination algorithm... but the results (algo finished during the party) disappointed me... And when I saw the fantastic Burj Babil by Psycho, he was right about using a plain raymarcher without any complicated true light management... a good "basic" raymarching algo, with some tone mapping finetune was much more relevant here!

Anyway, my GI experiment on the compute shader will probably deserve an article here.

I really enjoyed to make this demo and to see that ergon was able to make it in the top 3... after seeing BP2009, I was not expecting at all the intro to be in the top 3!... although I know that the competition this year was far much easier than the previous BP!

Anyway, that was nice to work with my friend ulrick... and to contribute to the demoscene with this prod. I hope that I will be able to keep on working on the demos like this... I still have lots of things to learn, and that's cool!

Import and Export 3D Collada files with C#/.NET

2010-08-25T09:06:00.011+11:00

Looking at what kind of 3D file format I could work with, I know that Collada is a well established format, supported by several 3D modeling tools, with a public specification and a XML/Schema grammar description, very versatile - and thus very verbose. For the last years, I saw a couple of articles on, for example, "how to import them in the XNA content pipeline" or about Skinning Animation with Collada and XNA, with some brute force code, using DOM or XPath to navigate around the Collada elements.

Now, looking at the opportunity to use this format and to build a small 3D demo framework in C# around SlimDx, I tried to find a full implementation of a Collada loader, derived from the xsd official specification... but was disappointed to learn that most of the attempts failed to use the specification with an automatic tool like xsd.exe from Microsoft. If you don't know what's xsd.exe, It's simply a tool to work with XML schemas, generate schemas from a DLL assembly, generate C# classes from a xsd schema...etc, very useful when you want to use directly from the code an object model described in xsd. I will explain later why this is more convenient to use it, and what you can do with it that you cannot achieve with the same efficiency compare to raw DOM/Xpath access.

I had already used xsd tool in the past for NRenoiseTools project and found it quite powerful and simple, and was finally quite happy with it... But why the Collada xsd was not working with this tool?

Patching the Collada xsd

Firstly, I have downloaded the Collada xsd spec from Kronos group and ran it through the tool... too bad, there was indeed an error preventing xsd to work on it

Error: Error generating classes for schema 'COLLADASchema_141'.

- Group 'glsl_param_type' from targetNamespace='http://www.collada.org/2005/11

/COLLADASchema' has invalid definition: Circular group reference.

This error was quite old and got even a bug submitted to connect "xsd.exe fails with COLLADA schema. Prints circular reference problem". Well the problem is that looking more deeply at the xsd schema, the glsl_param_type doesn't make any circular group reference... weird...

Anyway, because this was just an error on the GLSL profile part of Collada spec, I removed this part, as this is not so much used... and did the same for CG and GLES profiles that had the same error.

Bingo! Xsd.exe tool was able to generate a -large - C# source file. I found it so easy that I was wondering why they had so much pain with it in the past? Well, running a simple program to load a sample DAE collada files... and got a deep exception :



Member 'Text' cannot be encoded using the XmlText attribute

A few internet click away, I found exactly a guy having the same error... from the code:

/// <remarks/>
[System.Xml.Serialization.XmlTextAttribute()]
public double[] Text {
    get {
        return this.textField;
    }
    set {
        this.textField = value;
    }
}

XmlTextAttribute specify that the "Text" property should be serialized inside the content of the xml element... but unfortunately, the XmlText attribute doesn't work on arrays of primitives!

Someone suggested him several options, and the simplest among them was to use a simple string to serialize the content instead of using an array... This is a quite common trick if you are familiar with xml serializing in .NET (and also with WCF DataContract xml serialization from .NET). So I went this way... It was quite easy, because the file had less than 10 occurrences to patch, so I patched them manually... with the kind of following code:

/// <remarks />
[XmlText]
public string _Text_
{
    get { return COLLADA.ConvertFromArray(Values); }

    set { Values = COLLADA.ConvertDoubleArray(value); }
}

/// <remarks />
[XmlIgnore]
public double[] Values
{
    get { return textField; }
    set { textField = value; }
}

I put a XmlIgnore on the renamed "Values" property that use the double[] and add a string property that performs a two-way conversion to that values (while adding the ConvertFromArray and ConvertDoubleArray functions at the end of the xsd generated file.

And... It was fully working!

Using Collada model from C#

With the generated classes, this is much easier to safely read the document, to access collada elements, having intellisense completion to help you on this laborious task. I have also added just 2 methods to load and save directly dae files from a stream or a file. The code iterating on Collada elements is something like (dummy code):

// Load the Collada model
COLLADA model = COLLADA.Load(inputFileName);

// Iterate on libraries
foreach (var item in model.Items)
{
    var geometries = item as library_geometries;
    if (geometries== null)
    continue;
    
    // Iterate on geomerty in library_geometries 
    foreach (var geom in geometries.geometry)
    {
        var mesh = geom.Item as mesh;
        if (mesh == null)
        continue;
        
        // Dump source[] for geom
        foreach (var source in mesh.source)
        {
            var float_array = source.Item as float_array;
            if (float_array == null)
                continue;
        
            Console.Write("Geometry {0} source {1} : ",geom.id, source.id);
            foreach (var mesh_source_value in float_array.Values)
                Console.Write("{0} ",mesh_source_value);
            Console.WriteLine();
        }
    
        // Dump Items[] for geom
        foreach (var meshItem in mesh.Items)
        {
        
            if (meshItem is vertices)
            {
                var vertices = meshItem as vertices;
                var inputs = vertices.input;
                foreach (var input in inputs)
                    Console.WriteLine("\t Semantic {0} Source {1}", input.semantic, input.source);                                
            }
            else if (meshItem is triangles)
            {
                var triangles = meshItem as triangles;
                var inputs = triangles.input;
                foreach (var input in inputs)
                    Console.WriteLine("\t Semantic {0} Source {1} Offset {2}",     input.semantic, input.source, input.offset);
                Console.WriteLine("\t Indices {0}", triangles.p);
            }
        }
    }
}

// Save the model
model.Save(inputFileName + ".test.dae");

One thing that could be of an interest, is that not only you can easily load a Collada dae file... but you can export them as well! I did a couple of experiment to verify that importing and exporting a Collada file is producing the same file, and It seems to work like a charm... meaning that if you want to produce some procedural Collada models to load them back in a 3D modeling tool, this is quite straightforward! But anyway, my main concern was to have a solid Collada loader that is compliant with the spec and performs most of the tedious fields conversion for me.

Of course, having such a loader in C# is just a very small part of the whole picture in order to create a full importer supporting most of the Collada features for a custom renderer... but that's probably the less exciting part of developing such an importer, so having this C# Collada model will be probably helpful.

Note: You can download the C# Collada model here. This is only a single C# source file that you can add directly to your project!

The model is stored inside the namespace Collada141 (in order to support multiple incompatible version of the Collada spec), and the root class (as specified in the xsd) is the COLLADA class, which contains also the two added Load/Save methods.

Also, a nice thing about the generated model from xsd.exe is that it allows you to extend the object model methods outside the csharp file. All the classes are declared partial, so It's quite easy to add some helpers method directly inside the Collada object model without touching directly the generated file.

Let me know if you are using it!

Democoding, tools coding and coding scattering

2010-08-13T07:58:00.005+11:00

Not so much post here for a while... So I'm going to just recap some of the coding work I have done so far... you will notice that It's going in lots of direction, depending on opportunities, ideas, sometimes not related to democoding at all... not really ideal when you want to release something! ;)

So, here are some directions I have been working so far...

C# and XNA

I tried to work more with C#, XNA... looking for an opportunity to code a demo in C#... I even started a post about it few months ago, but leaving it in a draft state. XNA is really great, but I had some bad experience with it... I was able to use it without requiring a full install but while playing with model loading, I had a weird bug called the black model bug. Anyway, I might come back to C# for DirectX stuff... SlimDx is for example really helpful for that.

A 4k/64k softsynth

I have coded a synth dedicated to 4k/64k coding. Although, right now, I only have the VST and GUI fully working under Renoise.. but not yet the asm 4k player! ;)

The main idea was to build a FM8/DX7 like synth, with exactly the same output quality (excluding some fancy stuff like the arpegiator...). The synth was developed in C# using vstnet, but must be more considered as a prototype under this language... because the asm code generated by the JIT is not really good when it comes to floating point calculation... anyway, It was really good to develop under this platform, being able to prototype the whole thing in few days (and of course, much more days to add rich GUI interaction!).

I still have to add a sound library file manager and the importer for DX7 patch..... Yes, you have read it... my main concern is to provide as much as possible a tons of ready-to-use patches for ulrick (our musician at FRequency)... Decoding the DX7 patch is well known around the net... but the more complex part was to make it decode like the FM8 does... and that was tricky... Right now, every transform functions are in an excel spreadsheet, but I have to code it in C# now!

You may wonder why developing the synth in C# if the main target is to code the player in x86 asm? Well, for practical reasons : I needed to quickly experiment the versatility of the sounds of this synth and I'm much more familiar with .NET winform to easily build some complex GUI. Although, I have done the whole synth with 4k limitation in mind... especially about data representation and complexity of the player routine.

For example, for the 4k mode of this synth, waveforms are strictly restricted to only one : sin! No noise, no sawtooth, no square... what? A synth without those waveform?.... but yeah.... When I looked back at DX7 synth implem, I realized that they were using only a pure "sin"... but with the complex FM routing mechanism + the feedback on the operators, the DX7 is able to produce a large variety of sounds ranging from strings, bells, bass... to drumkits, and so on...

I did also a couple of effects, mainly a versatile variable delay line to implement Chorus/Flanger/Reverb.

So basically, I should end up with a synth with two modes :
- 4k mode : only 6 oscillators per instrument, only sin oscillators, simple ADSR envelope, full FM8 like routing for operators, fixed key scaling/velocity scaling/envelope scaling. Effects per instrument/global with a minimum delay line + optional filters. and last but not least, polyphony : that's probably the thing I miss the most in 4k synth nowadays...
- 64k mode : up to 8 oscillators per instrument, all FM8 oscillators+filters+WaveShaping+RingModulation operators, 64 steps FM8's like envelope, dynamic key scaling/velocity scaling/envelope scaling. More effects, with better quality, 2 effect //+serial line per instrument. Additional effects channel to route instrument to the same effects chain. Modulation matrix.

The 4k mode is in fact restricting the use of the 64k mode, more at the GUI level. I'm currently targeting only the 4k mode, while designing the synth to make it ready to support 64k mode features.

What's next? Well, finish the C# part (file manager and dx7 import) and starting the x86 asm player... I just hope to be under 700 compressed byte for the 4k player (while the 64k mode will be written in C++, with an easier limitation around 5Ko of compressed code) .... but hey, until It's not coded... It's pure speculation!.... And as you can see, the journey is far from finished! ;)

Context modeling Compression update

During this summer, I came back to my compression experiment I did last year... The current status is quite pending... The compressor is quite good, sometimes better than crinkler for 4k... but the prototype of the decompressor (not working, not tested....) is taking more than 100 byte than crinkler... So in the end, I know that I would be off more than 30 to 100 byte compared to crinkler... and this is not motivating me to finish the decompressor and to get it really running.

The basic idea was to take the standard context modeling approach from Matt Mahoney (also known as PAQ compression, Matt did a fantastic job with his research, open source compressor....by the way), using dynamic neural network with an order of 8 (8 byte context history), with the same mask selection approach than crinkler + some new context filtering at the bit level... In the end, the decompressor is using the FPU to decode the whole thing... as it needs ln2() and pow2() functions... So during the summer, I though using another logistic activation function to get rid of the FPU : the standard sigmoid used in the neural network with a base 2 is 1/(1+2^-x)), so I found something similar with y = (x / (1 + |x|) + 1) /2 from David Elliot (some references here). I didn't have any computer at this time to test it, so I spent few days to put some math optimization on it, while calculating the logit function (the inverse of this logistic function).

I came back to home very excited to test this method... but I was really disappointed... the function had a very bad impact on compression ratio by a factor of 20%, in the end, completely useless!

If by next year, I'm not able to release anything from this.... I will put all this work open source, at least for educational purposes... someone will certainly be clever than me on this and tweak the code size down!

SlimDx DirectX wrapper's like in C++

Recall that for the ergon intro, I have been working with a very thin layer around DirectX to wrap enums/interfaces/structures/functions. I did that around D3D10, a bit of D3D11, and a bit of D3D9 (which was the one I used for ergon). The goal was to achieve a DirectX C# like interface in C++. While the code has been coded almost entirely manually, I was wondering If I could not generate It directly from DirectX header files...

So for the last few days, I have been a bit working on this... I'm using boost::wave as the preprocessor library... and I have to admit that the C++ guy from boost lost their mind with templates... It's amazing how they did something simple so complex with templates... I wanted to use this under a C++/Cli managed .NET extension to ease my development in C#, but I end up with a template error at link stage... an incredible error with a line full of concatenated template, even freezing visual studio when I wanted to see the errors in the error list!

Template are really nice, when they are used not too intensively... but when everything is templatized in your code, It's becoming very hard to use fluently a library and It's sometimes impossible to understand the template error, when this error is more than 100 lines full of cascading template types!

Anyway, I was able to plug this boost::wave in a native dll, and calling it from a C# library... next step is to see how much I can get from DirectX header files to extract a form of IDL (Interface Definition Language). If I cannot get something relevant in the next week, I might postpone this task when I won't have anything more important to do! The good thing is for example for D3D11 headers, you can see that those files were auto-generated from a mysterious... d3d11.idl file...used internally at Microsoft (although It would have been easier to get directly this file!)... so It means that the whole header is quite easy to parse, as the syntax is quite systematic.

Ok, this is probably not linked to intros... or probably only for 64k.... and I'm not sure I will be able to finish it (much like rmasm)... And this kind of work is keeping me away from directly working with DirectX, experimenting rendering techniques and so on... Well, I have to admit also that I have been more attracted for the past few years to do some tools to enhance coding productivity (not necessary only mine)... I don't like to do too much things manually.... so everytime there is an opportunity to automatize a process, I can't refrain me to make it automatic! :D

AsmHighlighter and NShader next update

Following my bad appetite for tools, I need to make some update to AsmHighlighter and NShader, to add some missing keywords, patch a bug, support for new VS2010 version... whatever... When you release this kind of open source project, well, you have to maintain them, even if you don't use them too much... because other people are using them, and are asking for improvements... that's the other side of the picture...

So because I have to maintain those 2 projects, and they are in fact sharing logically more than 95% of the same code, I have decided to merge them into a single one... that will be available soon under codeplex as well. That will be easier to maintain, ending with only one project to update.

The main features people are asking is to be able to add some keywords easily and to map file extensions to the syntax highlighting system... So I'm going to generalize the design of the two project to make them more configurable... hopefully, this will cover the main features request...

An application for Windows Phone 7... meh?

Yep... I have to admit that I'm really excited by the upcoming Windows Phone 7 metro interface... I'm quite fed up with my iPhone look and feel... and because the development environment is so easy with C#, I have decided to code an application for it. I'm starting with a chromatic tuner for guitar/piano/violins...etc. and it's working quite well, even if I was able to test it only under the emulator. While developing this application, I have learned some cool things about pitch detection algorithm and so on...

I hope to finish the application around september, and to be able to test it with a real hardware when WP7 will be offcialy launched... and before puting this application on the windows marketplace.

If this is working well, I would study to develop other applications, like porting the softsynth I did in C# to this platform... We will see... and definitely, this last part is completely unrelated to democoding!

What's next?

Well, I have to prioritize my work for the next months:

Merge AsmHighlighter and NShader into a single project.
Play a bit for one week with DirectX headers to see if I could extract some IDL's like information
Finish the 4k mode of the softsynth... and develop the x86 asm player
Finish the WP7 application

I still have also an article to write about ergon's making of, not much to say about it, but It could be interesting to write down on a paper those things....

I need also to work on some new directX effects... I have played a bit with hardware instantiating, compute shaders (with a raymarching with global illumination for a 4k procedural compo that didn't make it to BP2010, because the results were not enough impressive, and too slow to calculate...)... I would really want to explore more about SSAO things with plain polygons... but I didn't take time for that... so yep, practicing more graphics coding should be on my top list... instead of all those time consuming and - sometimes useful - tools!

Playing a MP3 in c++ using plain Windows API

2010-05-22T09:16:00.010+11:00

While playing a mp3 is quite common in a demo, I have seen that most demo are often using 3rd party dlls like Bass or FMod to perform this simple task under windows. But if we want to get rid off this dependency, how can we achieve this with a plain windows API? What's the requirements to have a practical MP3Player for a demo?

Surprisingly, I was not able to find a simple code sample other the Internet that explain how to play a mp3 with Windows API, without using the too simple Windows Media Player API. Why WMP is not enough (not even talking about MCI - Media Control Interface which is even more basic than WMP)?

Well, It's lacking one important feature : It's only able to play from an url, so it's not possible for example to pack the song in an archive and play it from a memory location (although not a huge deal if you want to release the song on the side of your demo). Also I have never tested the timing returned by WMP (using probably IWMPControls3 getPositionTimeCode) and not really sure It's able to provide a reliable sync (at least, If you intend to use sync... but hey, is a demo without any sync, can be still called a demo?:)

So I started to find some peace of code around the net but they were covering only part of the problem. The starting point was to rely on the Audio Compression Manager API that provides an API conversion to perform for example a mp3 to pcm. Hopefully, I found the code from a guy that was kind enough to post the whole converter for a mp3 file using ACM. In the mean time, I found that Mark Heath, the author of NAudio posted few days ago a solution to convert a MP3 to WAV using NAudio. Looking at his code, he was using also ACM but he reported also some difficulties to implement a reliable MP3Frame/ID3Tag decoder in order to extract samplerate, bitrate, channels...etc. I didn't want to use this kind of heavy code and was looking a lighter and reliable solution for this : most of the people were talking about using the Windows Media Format SDK to get all this information from the file. The starting point is the WMCreateSyncReader method. Through this method, you are able to retrieve part of MP3Frame as well as ID3Tag.

Finally, I came with a patchwork solution :

using SyncReader from WMF to extract song duration.
using ACM to decode the mp3 to pcm
using plain old waveOut functions to perform sound playback and retrieve sound playback position.

Everything is inside a single .h with less than 300 lines including comments. I don't really know If it's the best way to play a mp3 from a file or from the memory, with Windows API, while still providing a reliable timing. I have tested it against a couple of mp3, thus It may still have some bugs... but at least, It's working quite well and It's a pretty small code. For example, the code is expecting the input mp3 to be at 44100Hz samplerate... If not It should probably failed... although with the use of WMF, It's quite easy to extract the sampleRate (although I'm not using it in the sample code provided here... was not sure about the result though :) )

Also, the code is not decoding&playing in realtime the song but is instead performing the decoding in a single pass and then playing the decoded buffer. This requires that the full pcm song to be allocated, which could be around 20Mo to 50Mo depending on the size of your song (It's easy to calculate : durationInSecondsOfTheSong * 4 * 441000, so a 3min song is requiring 30Mo). This is not probably the best solution, but It's not a huge task to transform this code to do realtime decoding/playback. The downside is that It will take some CPU in your demo. So that in the end, It's a just tradeoff between memory vs cpu depending on your needs!

/* ----------------------------------------------------------------------
 * MP3Player.h C++ class using plain Windows API
 *
 * Author: @lx/Alexandre Mutel,  blog: http://code4k.blogspot.com
 * The software is provided "as is", without warranty of any kind.
 * ----------------------------------------------------------------------*/
#pragma once
#include <windows.h>
#include <stdio.h>
#include <assert.h>
#include <mmreg.h>
#include <msacm.h>
#include <wmsdk.h>

#pragma comment(lib, "msacm32.lib") 
#pragma comment(lib, "wmvcore.lib") 
#pragma comment(lib, "winmm.lib") 
#pragma intrinsic(memset,memcpy,memcmp)

#ifdef _DEBUG
#define mp3Assert(function) assert((function) == 0)
#else
//#define mp3Assert(function) if ( (function) != 0 ) { MessageBoxA(NULL,"Error in [ " #function "]", "Error",MB_OK); ExitProcess(0); }
#define mp3Assert(function) (function)
#endif

/*
 * MP3Player class.
 * Usage : 
 *   MP3Player player;
 *   player.OpenFromFile("your.mp3");
 *   player.Play();
 *   Sleep((DWORD)(player.GetDuration()+1));
 *   player.Close();
 */
class MP3Player {
private:
 HWAVEOUT hWaveOut;
 DWORD bufferLength;
 double durationInSecond;
 BYTE* soundBuffer;
public:

 /*
  * OpenFromFile : loads a MP3 file and convert it internaly to a PCM format, ready for sound playback.
  */
 HRESULT OpenFromFile(TCHAR* inputFileName){
  // Open the mp3 file
  HANDLE hFile = CreateFile(inputFileName, // open MYFILE.TXT
         GENERIC_READ,
         FILE_SHARE_READ, // share for reading
         NULL, // no security
         OPEN_EXISTING, // existing file only
         FILE_ATTRIBUTE_NORMAL, // normal file
         NULL); // no attr
  assert( hFile != INVALID_HANDLE_VALUE);

  // Get FileSize
  DWORD fileSize = GetFileSize(hFile, NULL);
  assert( fileSize != INVALID_FILE_SIZE);

  // Alloc buffer for file
  BYTE* mp3Buffer = (BYTE*)LocalAlloc(LPTR, fileSize);

  // Read file and fill mp3Buffer
  DWORD bytesRead;
  DWORD resultReadFile = ReadFile( hFile, mp3Buffer, fileSize, &bytesRead, NULL);
   assert(resultReadFile != 0);
  assert( bytesRead == fileSize);

  // Close File
  CloseHandle(hFile);

  // Open and convert MP3
  HRESULT hr = OpenFromMemory(mp3Buffer, fileSize);

  // Free mp3Buffer
  LocalFree(mp3Buffer);

  return hr;
 }

 /*
  * OpenFromMemory : loads a MP3 from memory and convert it internaly to a PCM format, ready for sound playback.
  */
 HRESULT OpenFromMemory(BYTE* mp3InputBuffer, DWORD mp3InputBufferSize){
  IWMSyncReader* wmSyncReader;
  IWMHeaderInfo* wmHeaderInfo;
  IWMProfile* wmProfile;
  IWMStreamConfig* wmStreamConfig;
  IWMMediaProps* wmMediaProperties;
  WORD wmStreamNum = 0;
  WMT_ATTR_DATATYPE wmAttrDataType;
  DWORD durationInSecondInt;
  QWORD durationInNano;
  DWORD sizeMediaType;
  DWORD maxFormatSize = 0;
  HACMSTREAM acmMp3stream = NULL;
  HGLOBAL mp3HGlobal;
  IStream* mp3Stream;

  // Define output format
  static WAVEFORMATEX pcmFormat = {
   WAVE_FORMAT_PCM, // WORD        wFormatTag;         /* format type */
   2,     // WORD        nChannels;          /* number of channels (i.e. mono, stereo...) */
   44100,    // DWORD       nSamplesPerSec;     /* sample rate */
   4 * 44100,   // DWORD       nAvgBytesPerSec;    /* for buffer estimation */
   4,     // WORD        nBlockAlign;        /* block size of data */
   16,     // WORD        wBitsPerSample;     /* number of bits per sample of mono data */
   0,     // WORD        cbSize;             /* the count in bytes of the size of */
  };

  const DWORD MP3_BLOCK_SIZE = 522;

  // Define input format
  static MPEGLAYER3WAVEFORMAT mp3Format = {
   {
    WAVE_FORMAT_MPEGLAYER3,   // WORD        wFormatTag;         /* format type */
     2,        // WORD        nChannels;          /* number of channels (i.e. mono, stereo...) */
     44100,       // DWORD       nSamplesPerSec;     /* sample rate */
     128 * (1024 / 8),    // DWORD       nAvgBytesPerSec;    not really used but must be one of 64, 96, 112, 128, 160kbps
     1,        // WORD        nBlockAlign;        /* block size of data */
     0,        // WORD        wBitsPerSample;     /* number of bits per sample of mono data */
     MPEGLAYER3_WFX_EXTRA_BYTES,  // WORD        cbSize;        
   },
   MPEGLAYER3_ID_MPEG,      // WORD          wID;
   MPEGLAYER3_FLAG_PADDING_OFF,   // DWORD         fdwFlags;
   MP3_BLOCK_SIZE,       // WORD          nBlockSize;
   1,          // WORD          nFramesPerBlock;
   1393,         // WORD          nCodecDelay;
  };

  // -----------------------------------------------------------------------------------
  // Extract and verify mp3 info : duration, type = mp3, sampleRate = 44100, channels = 2
  // -----------------------------------------------------------------------------------

  // Initialize COM
  CoInitialize(0);

  // Create SyncReader
  mp3Assert( WMCreateSyncReader(  NULL, WMT_RIGHT_PLAYBACK , &wmSyncReader ) );

  // Alloc With global and create IStream
  mp3HGlobal = GlobalAlloc(GPTR, mp3InputBufferSize);
  assert(mp3HGlobal != 0);
  void* mp3HGlobalBuffer = GlobalLock(mp3HGlobal);
  memcpy(mp3HGlobalBuffer, mp3InputBuffer, mp3InputBufferSize);
  GlobalUnlock(mp3HGlobal);
  mp3Assert( CreateStreamOnHGlobal(mp3HGlobal, FALSE, &mp3Stream) );

  // Open MP3 Stream
  mp3Assert( wmSyncReader->OpenStream(mp3Stream) );

  // Get HeaderInfo interface
  mp3Assert( wmSyncReader->QueryInterface(&wmHeaderInfo) );

  // Retrieve mp3 song duration in seconds
  WORD lengthDataType = sizeof(QWORD);
  mp3Assert( wmHeaderInfo->GetAttributeByName(&wmStreamNum, L"Duration", &wmAttrDataType, (BYTE*)&durationInNano, &lengthDataType ) );
  durationInSecond = ((double)durationInNano)/10000000.0;
  durationInSecondInt = (int)(durationInNano/10000000)+1;

  // Sequence of call to get the MediaType
  // WAVEFORMATEX for mp3 can then be extract from MediaType
  mp3Assert( wmSyncReader->QueryInterface(&wmProfile) );
  mp3Assert( wmProfile->GetStream(0, &wmStreamConfig) );
  mp3Assert( wmStreamConfig->QueryInterface(&wmMediaProperties) );

  // Retrieve sizeof MediaType
  mp3Assert( wmMediaProperties->GetMediaType(NULL, &sizeMediaType) );

  // Retrieve MediaType
  WM_MEDIA_TYPE* mediaType = (WM_MEDIA_TYPE*)LocalAlloc(LPTR,sizeMediaType); 
  mp3Assert( wmMediaProperties->GetMediaType(mediaType, &sizeMediaType) );

  // Check that MediaType is audio
  assert(mediaType->majortype == WMMEDIATYPE_Audio);
  // assert(mediaType->pbFormat == WMFORMAT_WaveFormatEx);

  // Check that input is mp3
  WAVEFORMATEX* inputFormat = (WAVEFORMATEX*)mediaType->pbFormat;
  assert( inputFormat->wFormatTag == WAVE_FORMAT_MPEGLAYER3);
  assert( inputFormat->nSamplesPerSec == 44100);
  assert( inputFormat->nChannels == 2);

  // Release COM interface
  // wmSyncReader->Close();
  wmMediaProperties->Release();
  wmStreamConfig->Release();
  wmProfile->Release();
  wmHeaderInfo->Release();
  wmSyncReader->Release();

  // Free allocated mem
  LocalFree(mediaType);

  // -----------------------------------------------------------------------------------
  // Convert mp3 to pcm using acm driver
  // The following code is mainly inspired from http://david.weekly.org/code/mp3acm.html
  // -----------------------------------------------------------------------------------

  // Get maximum FormatSize for all acm
  mp3Assert( acmMetrics( NULL, ACM_METRIC_MAX_SIZE_FORMAT, &maxFormatSize ) );

  // Allocate PCM output sound buffer
  bufferLength = durationInSecond * pcmFormat.nAvgBytesPerSec;
  soundBuffer = (BYTE*)LocalAlloc(LPTR, bufferLength);

  acmMp3stream = NULL;
  switch ( acmStreamOpen( &acmMp3stream,    // Open an ACM conversion stream
   NULL,                       // Query all ACM drivers
   (LPWAVEFORMATEX)&mp3Format, // input format :  mp3
   &pcmFormat,                 // output format : pcm
   NULL,                       // No filters
   0,                          // No async callback
   0,                          // No data for callback
   0                           // No flags
   ) 
   ) {
      case MMSYSERR_NOERROR:
       break; // success!
      case MMSYSERR_INVALPARAM:
       assert( !"Invalid parameters passed to acmStreamOpen" );
       return E_FAIL;
      case ACMERR_NOTPOSSIBLE:
       assert( !"No ACM filter found capable of decoding MP3" );
       return E_FAIL;
      default:
       assert( !"Some error opening ACM decoding stream!" );
       return E_FAIL;
  }

  // Determine output decompressed buffer size
  unsigned long rawbufsize = 0;
  mp3Assert( acmStreamSize( acmMp3stream, MP3_BLOCK_SIZE, &rawbufsize, ACM_STREAMSIZEF_SOURCE ) );
  assert( rawbufsize > 0 );

  // allocate our I/O buffers
  static BYTE mp3BlockBuffer[MP3_BLOCK_SIZE];
  //LPBYTE mp3BlockBuffer = (LPBYTE) LocalAlloc( LPTR, MP3_BLOCK_SIZE );
  LPBYTE rawbuf = (LPBYTE) LocalAlloc( LPTR, rawbufsize );

  // prepare the decoder
  static ACMSTREAMHEADER mp3streamHead;
  // memset( &mp3streamHead, 0, sizeof(ACMSTREAMHEADER ) );
  mp3streamHead.cbStruct = sizeof(ACMSTREAMHEADER );
  mp3streamHead.pbSrc = mp3BlockBuffer;
  mp3streamHead.cbSrcLength = MP3_BLOCK_SIZE;
  mp3streamHead.pbDst = rawbuf;
  mp3streamHead.cbDstLength = rawbufsize;
  mp3Assert( acmStreamPrepareHeader( acmMp3stream, &mp3streamHead, 0 ) );

  BYTE* currentOutput = soundBuffer;
  DWORD totalDecompressedSize = 0;

  static ULARGE_INTEGER newPosition;
  static LARGE_INTEGER seekValue;
  mp3Assert( mp3Stream->Seek(seekValue, STREAM_SEEK_SET, &newPosition) );

  while(1) {
   // suck in some MP3 data
   ULONG count;
   mp3Assert( mp3Stream->Read(mp3BlockBuffer, MP3_BLOCK_SIZE, &count) );
   if( count != MP3_BLOCK_SIZE )
    break;

   // convert the data
   mp3Assert( acmStreamConvert( acmMp3stream, &mp3streamHead, ACM_STREAMCONVERTF_BLOCKALIGN ) );

   // write the decoded PCM to disk
   //count = fwrite( rawbuf, 1, mp3streamHead.cbDstLengthUsed, fpOut );
   memcpy(currentOutput, rawbuf, mp3streamHead.cbDstLengthUsed);
   totalDecompressedSize += mp3streamHead.cbDstLengthUsed;
   currentOutput += mp3streamHead.cbDstLengthUsed;
  };

  mp3Assert( acmStreamUnprepareHeader( acmMp3stream, &mp3streamHead, 0 ) );
  LocalFree(rawbuf);
  mp3Assert( acmStreamClose( acmMp3stream, 0 ) );

  // Release allocated memory
  mp3Stream->Release();
  GlobalFree(mp3HGlobal);
  return S_OK;
 }

 /*
  * Close : close the current MP3Player, stop playback and free allocated memory
  */
 void __inline Close() {
  // Reset before close (otherwise, waveOutClose will not work on playing buffer)
  waveOutReset(hWaveOut);
  // Close the waveOut
  waveOutClose(hWaveOut);
  // Free allocated memory
  LocalFree(soundBuffer);
 }
 
 /*
  * GetDuration : return the music duration in seconds
  */
 double __inline GetDuration() {
  return durationInSecond;
 }

 /*
  * GetPosition : return the current position from the sound playback (used from sync)
  */
 double GetPosition() {
  static MMTIME MMTime = { TIME_SAMPLES, 0};
  waveOutGetPosition(hWaveOut, &MMTime, sizeof(MMTIME));
  return ((double)MMTime.u.sample)/( 44100.0);
 }

 /*
  * Play : play the previously opened mp3
  */
 void Play() {
  static WAVEHDR WaveHDR = { (LPSTR)soundBuffer,  bufferLength };

  // Define output format
  static WAVEFORMATEX pcmFormat = {
   WAVE_FORMAT_PCM, // WORD        wFormatTag;         /* format type */
   2,     // WORD        nChannels;          /* number of channels (i.e. mono, stereo...) */
   44100,    // DWORD       nSamplesPerSec;     /* sample rate */
   4 * 44100,   // DWORD       nAvgBytesPerSec;    /* for buffer estimation */
   4,     // WORD        nBlockAlign;        /* block size of data */
   16,     // WORD        wBitsPerSample;     /* number of bits per sample of mono data */
   0,     // WORD        cbSize;             /* the count in bytes of the size of */
  };

  mp3Assert( waveOutOpen( &hWaveOut, WAVE_MAPPER, &pcmFormat, NULL, 0, CALLBACK_NULL ) );
  mp3Assert( waveOutPrepareHeader( hWaveOut, &WaveHDR, sizeof(WaveHDR) ) );
  mp3Assert( waveOutWrite  ( hWaveOut, &WaveHDR, sizeof(WaveHDR) ) );
 }
};

#pragma function(memset,memcpy,memcmp)

The usage is then pretty simple :

MP3Player player;

 // Open the mp3 from a file...
 player.OpenFromFile("your.mp3");
 // or From a memory location!
 player.OpenFromMemory(ptrToMP3Song, bytesLength);

 player.Play();

 while (...) {
   // Do here your synchro in the demo using
   ...
   double playerPositionInSeconds = player.GetPosition()
   ...
 }
 player.Close();

And that's all! Hope someone will find this useful!

You can download a Visual Studio project using the MP3Player.h class.

NShader 1.1, hlsl, glsl, cg syntax coloring for Visual Studio 2008 & 2010

2010-05-20T20:49:00.004+11:00

I have recently released NShader 1.1 which adds support for Visual Studio 2010 as well as bugfixes for hlsl/glsl syntax highlighting.

While this plugin is quite cool to add a basic syntax highlighting for shader languages, It lacks intellisense/completion/error markers to improve the editor experience. I didn't have time to add such a functionality in this release as... I don't really have too much time dedicated to this project... and well, I have so much to learn from effectively practicing a lot more shader languages that I'm fine with this basic syntax highlighting! ;) Is it a huge task to add intellisense? It depends, but concretely, I need to implement for each shading language a full grammar/lexer parser in order to provide a reliable intellisense. Of course, a very basic intellisense would be feasible without this, but I would rather not to use an annoying/unreliable intellisense popup.

Although, I did some research about existing lexers for shading languages, surprisingly, this is not something you can find easily. For hlsl for example, afaik, there is no bnf grammar published by Microsoft, so If you want to do it yourself, you need to go through the whole HLSL reference documentation and compile yourself a bnf... and that's something I can't afford in my spare time. One could argue that there are some startup code available on the net (O3D from google has an antlr parser/lexer, or a relative simpler one from Christian Schladetsch), agree with that, but well... It still ask a bit more time to patch them, add support for SM5.0, handle correctly preprocessor directives... and so on... After that, I need to integrate it through the language service API, not the worst part. Anyway, If someone is motivated to help me on this, we could come with something. We will follow also if Intelishade is able to resurrect in an open source way... a joint venture would be interesting.

Also, what's my feedback about migrating VS2008 language service to VS2010? Well, It was pretty straightforward! I did follow the sdk instructions about "Migrating a Legacy Language Service" but It was not fully working as expected. In fact, the only remaining problem was that the WSIX VS2001 installer didn't register automatically the NShader Language Service. I was forced to add manually the pkgdef file (containing registry update for the language service) to the vsix archive. While I was working on the migration to VS2010, I had a look at the new extensibility framework and was surprised to see that the new framework is by far much easier to implement in VS2010. Although, I didn't take the time to migrate NShader to use this new framework, It seems to be pretty easy... also nice thing is that they did provide a compatibility layer for legacy Language Service, so I didn't bother with the new api. But If I had to write a new plugin for VS, I would definitely use the new API, although It would only work with VS2010+ versions...

One small recurrent disappointment : Visual Studio is still restricting to provide plugins for Express editions. From a "commercial point of view", I understand this restriction, although for the thousands (million maybe?) of people using express edition, this is a huge lack of functionality.I'm sure that allowing community plugins into Express Editions would in fact improve a lot more Visual Studio adoption.

My next post should be about the making of Ergon at BP2010. I have a couple of things to share about it, but I'm quite lazy at that time to write this post... but It's on the way! ;)

Work in progress, 4k intro for Breakpoint 2010

2010-03-09T09:38:00.001+11:00

Hey, almost one month that I have posted a message about coding a 4k intro for BP 2010. So what's going on?

Well, I'm happy to be close to the end, fighting with the last available bytes, working on the music with ulrick, running, and rerunning several times the intro to work on the synchro. We are almost done now, but It's far from what I was expecting to release! I have developed around 4 to 5 scenes but was forced to only use 2 of them. I will publish a complete detail about compression ratio/vs part of this intro after releasing it.

We have finally worked with 4klang for the music. This synth is very well designed and very versatile. The only drawback would be the total size code + data that goes around 1.4k to 1.5k. That's a lot and It didn't help me to inject more scenes. I would expect a synth code + data to be around 1k to 1.1k max, of course with less flexibility and probably a sound that could not be as rich as 4klang, but still with something nice. I have started to implement a small softsynth player based on my previous work. But I have suspended this laborious task, as I know that It would have taken me much more time to plug everything in a VST (although It's quite easy when you do this through .NET), design a simple GUI, file formats, develop a cool sound bank, test it, debug it....etc. That option was too risky, so I have postponed this work after BP.

Anyway, from what I have seen while starting coding this softsynth, is that It's possible to go around 600 byte for the player... and probably 400 byte for the music + the sound bank... but I will be able to confirm this when I'm done with it, It's only a projection. The fact is that a stackbased synth is powerful, allowing for complex sound design (and synchronization/modulation, nicely done in 4klang) but If you look closer at it, you will see that most of the synth part are almost common to all sounds : While a stackbased synth allows everything, pragmatically, a sound respond to some classic design rules : a collection of cascaded/combined oscillators/noisers, few insert-fx (and the most common ones, filters, stereo delay/reverb), and those rules have a straight translation to a stack based system that you will recognize immediately among different instruments (and this pattern will repeat). This "static" pattern is probably more efficient for a 4k, both for the code size and the data, allowing for example to "bitify" data organization. Crinkler is using exactly this kind of static pattern (I still do have to write an article about it) : instead of having a generic context modeling compressor, It's using some kind of semi-dynamic/static context modeling that is in the end, much more powerful then it's generic equivalent (from what I have seen, a generic context modeling decompressor code is around 2% to 4% larger for a 4k, not a huge deal, but 2 to 4% is around 80 to 160 bytes for a 4k.... and that's a lot for a 4k). Of course, this is not only a question of some static algo vs more generic one... it's also a question of being able to produce a well done code-size optimized x86 assembler code (and crinkler just for this, is a masterpiece).

So, 400 bytes would have allowed me to add a scene, add some nice text... :) But ok, we had to move on! I'm not going to complain about 4klang, when rudebox was so impressive, using exactly the same synth. It's possible to code something cool... and in the mean time to get some benefits from 4klang sound quality, so I found this an acceptable compromise... and a nice challenge!

Anyway, see you soon at BP 2010!

Coding a 4k intro for Breakpoint 2010

2010-02-13T02:12:00.001+11:00

I'm going to be quite busy here until the demoparty Breakpoint 2010, as I want to release a small 4k intro for this great event! I didn't plan anything few weeks ago and was slowly working on d3d10/d3d11, working on some effects for a 16k/64k intro to release later this year... but BP2010 is supposed to be the last one and I don't want to miss this experience, first time to go there... and probably last time, so I have decided to go to this party and try to contribute to it, yep!

I have started to work with d3d10, as i wanted to add some direct2D nice layout text over 2-3 raymarching scenes... but I have found that direct2D is bit too costly for a 4k, specially if you want to avoid the basic white logo... so I have to switch back to a plain d3d9 and forget about some cool text. I will use this direct2D technique for a bigger intro. Due to my lack of investment in all the d3d apis until now, I had to manage the API from the ground up... and it took me a while... To facilitate the d3d10/d3d11 coding experience, I have developed a lightweight c++ wrapper around d3d10/11 APIs, almost exactly in the same way (naming conventions,enums) SlimDx has done the job, and I'm really happy with it. The d3d10/d3d11 API is very clean but due to some verbose API contraints (ugly enums mixed sometimes with some #define, HRESULT to check from every function return, a famous windows programming philosophy), It's really worth to wrap this API around something that transparently hide all those things, rename and rearrange enum/methods/interface. Currently, It's working great with this SlimDx's like wrapper, much easier to program, much easier to read, and It does generate almost exactly the same code than a straight d3d10/d3d11 code, with the capability to remove the HRESULT check and so on... I will probably release this wrapper ( a single .h with a bunch of inline methods) around codeplex later this year, as it should probably help people like me that prefer a syntax much more closer to the C# coding experience than c/c++.

During my small d3d11 incursion, I have also discovered that the DirectX Effect framework (fx syntax files, with techniques, pass) is no longer available as part of the D3DX runtime! Yep, you need to go to the Utility directory in the DirectX SDK to find that you can compile this framework yourself... It means that for all intros, you can forget about the DirectX Effect framework and program a much lighter Effect framework. One good thing about this change is that I had to better understand the d3d10 philosophy with the constant buffers access and so on... In fact, It's much easier to work directly with constant buffer, and surprisingly, It gives a smaller code. As soon as I'm done with the 4k intro for BP2010, I will publish a small post about this along the SlimDx like wrapper.

Last thing is that I didn't have time to finish my softsynth, because I was more targeting a 16k/64k, with a more complex synth... so we should go with the great 4klang gopher's synth... but I have two problems with it : 1) ulrick (my old friend, main FRequency musician) is unable to use it under Renoise. It burns its notebook's CPU and we don't know why, as it's a pretty standard core 2 duo intel processor... we did check lost of 4klang/renoise/system parameters without any success. 2) 4klang is great, but, the total code is often close to 900 bytes... I know that some of the top's 4k softsynth are around 500 to 700 bytes, so I'm not completely sure that I will use 4klang. The other idea is to take part of the work I did already for the bigger synth (that is developped in assembler x86), and try to plumb it into a fixed pipeline (and not a stackbased as 4klang)... I'm not sure, but I suspect that I could save a substantial amount of bytes... but still, not sure, and I need to check it... but It's going to be really hard to make it on time for BP... so we'll see...

Currently, I have only coded one scene for the 4k intro, I'm quite happy with it... but that doesn't make a full intro! I need to add at least 3 scenes, work on the transitions, overall design, synch with synth & so on... even for a 4k, that's a lots of work, moreover when you consider that this is my first prod on my own (I mean, first prod for the PC, after the 3 prods i released 20 years ago on Amiga! ;) ) but It's possible in less than 2 month to do it, so I'll try to do my best!