Wednesday, December 29, 2010

Crinkler secrets, 4k intro executable compressor at its best

(Edit 5 Jan 2011: New Compression results section and small crinkler x86 decompressor analysis)

If you are not familiar with 4k intros, you may wonder how things are organized at the executable level to achieve this kind of packing-performance. Probably the most important and essential aspect of 4k-64k intros is the compressor, and surprisingly, 4k intros have been well equipped for the past five years, as Crinkler is the best compressor developed so far for this category. It has been created by Blueberry (Loonies) and Mentor (tbc), two of the greatest demomakers around.

Last year, I started to learn a bit more about the compression technique used in Crinkler. It started from some pouet's comments that intrigued me, like "crinkler needs several hundred of mega-bytes to compress/decompress a 4k intros" (wow) or "when you want to compress an executable, It can take hours, depending on the compressor parameters"... I observed also bad comrpession result, while trying to convert some part of C++ code to asm code using crinkler... With this silly question, I realized that in order to achieve better compression ratio, you better need a code that is comrpession friendly but is not necessarily smaller. Or in other term, the smaller asm code is not always the best candidate for better compression under crinkler... so right, I needed to understand how crinkler was working in order to code crinkler-friendly code...

I just had a basic knowledge about compression, probably the last book I bought about compression was more than 15 years ago to make a presentation about jpeg compression for a physics courses (that was a way to talk about computer related things in a non-computer course!)... I remember that I didn't go further in the book, and stopped just before arithmetic encoding. Too bad, that's exactly one part of crinkler's compression technique, and has been widely used for the past few years (and studied for the past 40 years!), especially in compressors like H.264!

So wow, It took me a substantial amount of time to jump again on the compressor's train and to read all those complicated-statistical articles to understand how things are working... but that was worth it! In the same time, I spent a bit of my time to dissect crinkler's decompressor, extract the code decompressor in order to comment it and to compare its implementation with my little-own-test in this field... I had a great time to do this, although, in the end, I found that whatever I could do, under 4k, Crinkler is probably the best compressor ever.

You will find here an attempt to explain a little bit more what's behind Crinkler. I'm far from being a compressor expert, so if you are familiar with context-modeling, this post may sounds a bit light, but I'm sure It could be of some interest for people like me, that are discovering things like this and want to understand how they make 4k intros possible!


Crinkler main principles


If you want a bit more information, you should have a look at the "manual.txt" file in the crinkler's archive. You will find here lots of valuable information ranging from why this project was created to what kind of options you can setup for crinkler. There is also an old but still accurate and worth to look at powerpoint presentation from the author themselves that is available here.

First of all, you will find that crinkler is not strictly speaking an executable compressor but is rather an integrated linker-compressor. In fact, in the intro dev tool chain, It's used as part of the building process and is used inplace of your traditional linker.... while crinkler has the ability to compress its output. Why crinkler is better suited at this place? Most notably because at the linker level, crinkler has access to portions of your code, your data, and is able to move them around in order to achieve better compression. Though, for this choice, I'm not completely sure, but this could be also implemented as a standard exe compressor, relying on relocation tables in the PE sections of the executable and a good disassembler like beaengine in order to move the code around and update references... So, crinkler, cr-linker, compressor-linker, is a linker with an integrated compressor.

Secondly, crinkler is using a compression method that is far more aggressive and efficient than any old dictionary-coder-LZ methods : it's called context modeling coupled with an arithmetic coder. As mentioned in the crinkler's manual, the best place I found to learn about this was Matt Mahoney resource website. This is definitely the place to start when you want to play with context modeling, as there are lots of sourcecode, previous version of PAQ program, from which you can learn gradually how to build such a compressor (more particularly in earlier version of the program, when the design was still simple to handle). Building a context-modelling based compressor/decompressor is almost accessible from any developer, but one of the strength of crinkler is its decompressor size : around 210-220 bytes, which makes it probably the most efficient and smaller context-modelling decompressor in the world. We will see also that crinkler made one of the simplest choice for a context-modelling compressor, using a semi-static model in order to achieve better compression for 4k of datas, resulting in a less complex decompressor code as well.

Lastly, crinkler is optimizing the usage of the exe-PE file (which is the Windows Portable Executable format, the binary format of the a windows executable file, official description is available here). Mostly by removing the standard import table and dll loading in favor of a custom loader that exploit internal windows structure as well as storing function hashing in the header of the PE files to recover dll functions.

Compression method


Arithmetic coding


The whole compression problem in crinkler can be summarized like this: what is the probability of the next bit to compress/decompress to be 1? The better is the probability (meaning by matching the expecting result bit), the better is the compression ratio. Hence, Crinkler needs to be a little bit psychic?!

First of all, you probably wonder why probability is important here. This is mainly due to one compression technique called arithmetic coding. I won't go into the detail here and encourage the reader to read about the wikipedia article and related links. The main principle of arithmetic coding is its ability to encode into a single number a set of symbols for which you know their probability to occur. The higher the probability is for a known symbol, the lower the number of bits will be required to encode its compressed counterpart.

At the bit level, things are getting even simpler, since the symbols are only 1 or 0. So if you can provide a probability for the next bit (even if this probability is completely wrong), you are able to encode it through an arithmetic coder.

A simple binary arithmetic coder interface could look like this:
/// Simple ArithmeticCoder interface
class ArithmeticCoder {

   /// Decode a bit for a given probability.
   /// Decode returns the decoded bit 1 or 0
   int Decode(Bitstream inputStream, double probabilityForNextBit);

   /// Encode a bit (nextBit) with a given probability
   void Encode(Bitstream outputStream, int nextBit, double probabilityForNextBit);
}

And a simple usage of this ArithmeticCoder could look like this:
// Initialize variables
Bitstream inputCompressedStream = ...;
Bitstream outputStream = ...;
ArithmeticCoder coder;
Context context = ...;

// Simple decoder implem using an arithmetic coder
for(int i = 0; i < numberOfBitsToDecode; i++) { 
    // Made usage of our psychic alias Context class
    double nextProbability = context.ComputeProbability(); 

    // Decode the next bit from the compressed stream, based on this 
    // probability 
    int nextBit = coder.Decode( inputCompressedStream, nextProbability); 

    // Update the psychic and tell him, how much wrong or right he was! 
    context.UpdateModel( nextBit, nextProbability); 

    // Output the decoded bit 
    outputStream.Write(nextBit); 
}

So a Binary Arithmetic Coder is able to compress a stream of bits, if you are able to tell him what's the probability for the next bit in the stream. Its usage is fairly simple, although their implementations are often really tricky and sometimes quite obscure (a real arithmetic implementation should face lots of small problems : renormalization, underflow, overflow...etc.).

Working at the bit level here wouldn't have been possible 20 years ago, as It requires a tremendous amount of CPU (and memory for the psychic-context) in order to calculate/encode a single bit, but with nowadays computer power, It's less a problem... Lots of implem are working at the byte level for better performance, some of them can work at the bit level while still batching the decoding/encoding results at the byte level. Crinkler doesn't care about this and is working at the bit level, making the arithmetic decoder in less than 20 x86 ASM instructions.

The C++ pseudo-code for an arithmetic decoder is like this:

int ArithmeticCoder::Decode(Bitstream inputStream, double nextProbability) {
    int output = 0; // the decoded symbol

    // renormalization
    while (range < 0x80000000) {
        range <<= 1; 
        value <<= 1;
        value += inputStream.GetNextBit();
    }

    unsigned int subRange = (range * nextProbability);
    range = range - subRange;
    if (value >= range) { // we have the symbol 1
        value = value - range;
        range = subRange;
        output++;     // output = 1
    }

return output;
}

This is almost exactly what is used in crinkler, but this done in only 18 asm instructions! The crinkler arithmetic coder is using a 33 bit precision. The decoder only needs to handle up to 0x80000000 limit renormalization while the encoder needs to work on 64 bit to handle the 33 bit precision. This is much more convenient to work at this precision for the decoder, as it is able to easily detect renormalization (0x80000000 is in fact a negative number. The loop could have been formulated like while (range >= 0), and this is how it is done in asm).

So the arithmetic coder is the basic component used in crinkler. You will find plenty of arithmetic coder examples on Internet. Even if you don't fully understand the theory behind them, you can use them quite easily. I found for example an interesting project called flavor, which provides a tool to produce some arithmetic coders code based on a formal description (For example, a 32bit precision arithmetic coder description in flavor), pretty handy to understand how things are translated from different coder behaviors.

But, ok, the real brain here is not the arithmetic coder... but the psychic-context (the Context class above) which is responsible to provide a probability and to update its model based on the previous expectation. This is where a compressor is making the difference.

Context modeling - Context mixing


This is one great point about using an arithmetic coder: they can be decoupled from the component responsible to provide the probability for the next symbol. This component is called a context-modeling.

What is the context? It is whatever data can help your context-modeler to evaluate the probability for the next symbol to occur. Thus, the most obvious data for a compressor-decompressor is to use previous decoded data to update its internal probability table.

Suppose you have the following sequence of 8 bytes 0x7FFFFFFF,0xFFFFFFFF that is already decoded. What will be the next bit? It is certainly to be a 1, and you could bet on it as high as 98% of probability.

So this is not a surprise that using history of data is the key point for the context modeler to predict next bit (and well, we have to admit that our computer-psychic is not as good as he claims, as he needs to know the past to predict the future!).

Now that we know that to produce a probability for the next bit, we need to use historic data, how crinkler is using them? Crinkler is in fact maintaining a table of probability, up to 8 bytes + the current bits already read before the next bit. In the context-modeling jargon, it's often called the order (before context modeling, there was technique developped like PPM  for Partial Predition Matching and DMC for dynamic markov compression). But crinkler is using not only the last x bytes (up to 8), but sparse mode (as it is mentioned in PAQ compressors), a combination of the last 8 bytes + the current bits already read. Crinkler calls this a model: It is stored into a single byte :
  • The 0x00 model says that It doesn't use any previous bytes other than the current bits being read.
  • The 0x80 model says that it is using the previous byte + the current bits being read.
  • The 0x81 model says that is is using the previous byte and the -8th byte + the current bits being read.
  • The 0xFF model says that all 8 previous bytes are used
You probably don't see yet how this is used. We are going to take a simple case here: Use the previous byte to predict the next bit (called the model 0x80).

Suppose the sequence of datas :

0xFF, 0x80, 0xFF, 0x85, 0xFF, 0x88, 0xFF, ???nextBit???
         (0)         (1)         (2)    (3) | => decoder position

  • At position 0, we know that 0xFF is followed by bit 1 (0x80 <=> 10000000b). So n0 = 0, n1 = 1 (n0 denotes the number of 0 that follows 0xFF, n1 denotes the number of 1 that usually follows 0xFF)
  • At position 1, we know that 0xFF is still followed by bit 1: n0 = 0, n1 = 2
  • At position 2, n0 = 0, n1 = 3
  • At position 3, we have n0 = 0, n1 = 3, making the probability for one p(1) = (n1 + eps) / ( n0+eps + n1+eps). eps for epsilon, lets take 0.01. We have p(1) = (2+0.01)/(0+0.01 + 2+0.01) = 99,50%

So we have the probability of 99,50% at position (3) that the next bit is a 1.

The principle here is simple: For each model and an historic value, we associate n0 and n1, the number of bits found for bit 0 (n0) and bit 1 (n1). Updating those n0/n1 counters needs to be done carefully : a naive approach would be to increment according values when a particular training bit is found... but there is more chance that recent values are more relevant than olders.... Matt Mahoney explained this in The PAQ1 Data Compression Program, 2002. (Describes PAQ1), and describes how to efficiently update those counters for a non-stationary source of data :
  • If the training bit is y (0 or 1) then increment ny (n0 or n1).
  • If n(1-y) > 2, then set n(1-y) = n(1-y) / 2 + 1 (rounding down if odd).

Suppose for example that n0 = 3 and n1 = 4 and we have a new bit 1. Then n0 will be = n0/2 + 1 = 3/2+1=2 and n1 = n1 + 1 = 5

Now, we know how to produce a single probability for a single model... but working with a single model (for exemple, only the previous byte) wouldn't be enough to evaluate correctly the next bit. Instead, we need a way to combine different models (different selection of historic data). This is called context-mixing, and this is the real power of context modeling: whatever is your method to collect and calculate a probability, you can, at some point, mix severals estimator to calculate a single probability.

There are several ways to mix those probabilities. In the pure context-modeling jargon,  the model is the way you mix probabilities and for each model, you have a weight :
  • static: you determine the weights whatever the data are.
  • semi-static: you perform a 1st pass over the data to compress to determine the weights for each model, and them a 2nd pass with the best weights
  • adaptive: weights are updated dynamically as new bits are discovered.

Crinkler is using a semi-static context-mixing but is somewhat also "semi-adaptive", because It is using different weights for the code of your exe, and the data of your exe, as they have a different binary layout.

So how this is mixed-up? Crinkler needs to determine the best context-models (the combination of historic data) that It will use, assign for each of those context a weight. The weight is then used to calculate the final probability.


For each selected historic model (i) with an associated model weight wi, and ni0/ni1 bit counters, the final probability p(1) is calculated like this :

p(1) = Sum(  wi * ni1 / (ni0 + ni1))  / Sum ( wi )

This is exactly what is done in the code above for context.ComputeProbability();, and this is exactly what crinkler is doing.

In the end, crinkler is selecting a list of models for each type of section in your exe: a set of models for the code section, a set of models for the data section.

How many models crinkler is selecting? It depends on your data. For example, for ergon intro,crinklers is selecting the following models:

For the code section:
           0    1    2    3    4    5    6    7    8    9   10   11   12   13 
Model  {0x00,0x20,0x60,0x40,0x80,0x90,0x58,0x4a,0xc0,0xa8,0xa2,0xc5,0x9e,0xed,}
Weight {   0,   0,   0,   1,   2,   2,   2,   2,   3,   3,   3,   4,   6,   6,}

For the data section:
           0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19 
Model  {0x40,0x60,0x44,0x22,0x08,0x84,0x07,0x00,0xa0,0x80,0x98,0x54,0xc0,0xe0,0x91,0xba,0xf0,0xad,0xc3,0xcd,}
Weight {   0,   0,   0,   0,   0,   0,   0,   1,   1,   2,   2,   2,   3,   3,   3,   4,   4,   4,   4,   5,}
(note that in crinkler, the final weight used to multiply n1/n0+n1 is by 2^w, and not wi itself).

Wow, does it means that crinkler needs to store those datas in your exe. (14 bytes + 20 bytes) * 2 = 68 bytes? Well, crinkler authors are smarter than this! In fact the models are stored, but weights are only store in a single int (32 bits for each section). Yep, a single int to stored those weights? Indeed: if you look at those weights, they are increasing, sometimes they are equal... So they found a clever way to store a compact representation of those weights in a 32 bit form. Starting with a weight of 1, the 32bit weight is shifted by one bit to the left : If this is 0, than the currentWeight doesn't change, if bit is 1, than currentWeight is incremented by 1 : (in this pseudo-code, shift is done to the right)

int currentWeight = 1;
int compactWeight = ....;
foreach (model in models) {
  if ( compactWeight & 1 )
    currentWeigh++;
  compactWeight =  compactWeight >> 1;

//  ... used currentWeight for current model
}

This way, crinkler is able to store a compact form of pairs (model/weight) for each type of data in your executable (code or pure data).

Model selection


Model selection is one of the key process of crinkler. For a particular set of datas, what is the best selection of models? You start with 256 models (all the combinations of the 8 previous bytes) and you need to determine the best selection of models. You have to take into account that each time you are using a model, you need to use 1 byte in your final executable to store this model. Model selection is part of crinkler compressor but is not part of crinkler decompressor. The decompressor just need to know the list of the final models used to compress the data, but doesn't care about intermediate results. On the other hand, the compressor needs to test every combination of model, and find an appropriate weight for each model.

I have tested several methods in my test code and try to recover the method used in crinkler, without achieving comparable compression ratio... I tried some brute force algo without any success... The selection algorithm is probably a bit clever than the one I have tested, and would probably require to layout mathematics/statistics formulas/combination to select an accurate method.

Finally, blueberry has given their method (thanks!)

"To answer your question about the model selection process, it is actually not very clever. We step through the models in bit-mirrored numerical order (i.e. 00, 80, 40, C0, 20 etc.) and for each step do the following:

- Check if compression improves by adding the model to the current set of models (taking into account the one extra byte to store the model).

- If so, add the model, and then step through every model in the current set and remove it if compression improves by doing so.

The difference between FAST and SLOW compression is that SLOW optimizes the model weights for every comparison between model sets, whereas FAST uses a heuristic for the model weights (number of bits set in the model mask).
"


On the other hand, I tried a fully adaptive context modelling approach, using dynamic weight calculation explained by Matt Mahoney with neural networks and stretch/squash functions (look at PAQ on wikipedia). It was really promising, as I was able to achieve sometimes better compression ratio than crinkler... but at the cost of a decompressor 100 bytes heavier... and even I was able to save 30 to 60 bytes for the compressed data, I was still off by 40-70 bytes... so under 4k, this approach was definitely not as efficient as a semi-static approach chosen by crinkler.

Storing probabilities


If you have correctly followed the previous model selection, crinkler is now working with a set of models (selection of history data), for each bit that is found, each model probabilities must be updated...

But think about it: for example, if to predict the following bit, we are using the probabilities for the 8 previous bytes, it means that for every combination of 8 bytes already found in the decoded data, we would have a pair of n0/n1 counters?

That would mean that we could have the folowing probabilities to update for the context 0xFF (8 previous bytes):
- "00 00 00 00 c0 00 00 50 00" => some n0/n1
- "00 00 70 00 00 00 00 F2 01" => another n0/n1
- "00 00 00 40 00 00 00 30 02" => another n0/n1
...etc.

and if we have other models like 0x80 (previous byte), or 0xC0 (the last 2 previous bytes), we would have also different counters for them:

// For model 0x80
- "00" => some n0/n1
- "01" => another n0/n1
- "02" => yet another n0/n1
...

// For model 0xC0
- "50 00" => some bis n0/n1
- "F2 01" => another bis n0/n1
- "30 02" => yet another bis n0/n1
...

From the previous model context, I have slightly over simplified the fact that not only the previous bytes is used, but also the current bits being read. In fact, when we are using for example the model 0x80 (using the previous byte), the context of the historic data is composed not only by the previous byte, but also by the bits being read on the current octet. This implies obviously that for every bit read, there is a different context. Suppose we have the sequence 0x75, 0x86 (in binary 10000110b), the position of the encoded bits is just after the 0x75 value and that we are using the previous byte + the bits currently read:

First, we start on a byte boundary
- 0x75 with 0 bit (we start with 0) is followed by bit 1 (the 8 of 0x85). The context is 0x75 + 0 bit read
- We read one more bit, we have a new context :  0x75 + bit 1. This context is followed by a 0
- We read one more bit, we have a new context :  0x75 + bit 10. This context is followed by a 0.
...
- We read one more bit, we have a new context :  0x75 + bit 1000011, that is followed by a 0 (and we are ending on a byte boundary).

Reading 0x75 followed by 0x86, with a model using only the previous byte, we finally have 8 context with their own n0/n1 to store in the probability table.

As you can see, It is obvious that It's difficult to store all context found (.i.e for each single bit decoded, there is a different context of historic bytes) and their respective exact probability counters, without exploding the RAM. Moreover if you think about the number of models that are used by crinkler: 14 types of different historic previous bytes selection for ergon's code!

This kind of problem is often handled using a hashtable while handling collisions. This is what is done in some of the PAQ compressors. Crinkler is also using an hashtable to store counter probabilities, with the association context_history_of_bytes = > (n0/n1), but It is not handling collision in order to keep minimal the size of the decompressor. As usual, the hash function used by crinkler is really tiny while still giving really good results.

So instead of having the association between  context_history_of_bytes => n0/n1, we are using a hashing function, hash(context_history_of_bytes) => n0/n1. Then, the dictionary that is storing all those associations needs to be correctly dimensioned, large enough, to store as much as possible associations found while decoding/encoding the data.

Like in PAQ compressors, crinkler is using one byte for each counter, meaning that n0 and n1 together are taking 16 bit, 2 bytes. So if you instruct crinkler to use a hashtable of 100Mo, It will be possible to store 50 millions of different keys, meaning different historic context of bytes and their respective probability counters. There is a little remark about crinkler and the byte counter: in PAQ compressors, limits are handled, meaning that if a counter is going above 255, It will stuck to 255... but crinkler made the choice to not test the limits in order to keep the code smaller (although, that would take less than 6 bytes to test the limit). What is the impact of this choice? Well, if you know crinkler, you are aware that crinkler doesn't handle large section of "zeros" or whatever empty initialized data. This is just because the probabilities are looping from 255 to 0, meaning that you jump from a 100% probability (probably accurate) to almost a 0% probability (probably wrong)  every 256 bytes. Is this really hurting the compression? Well, It would hurt a lot if crinkler was used for larger executable, but in a 4k, It's not hurting so much (although, It could hurt if you really have large portions of initialized data). Also, not all the context are reseted at the same time (a 8 byte context will not probably reset as often as a 1 byte context), so it means that final probability calculation is still accurate... while there is a probability that is reseted, other models with their own probabilities are still counting there... so this is not a huge issue.

What happens also if the hash for a different context is giving the same value? Well, the model is then updating the wrong probability counters. If the hashtable is too small the probability counters may really be too much disturbed and they would provide a less accurate final probability. But if the hashtable is large enough, collisions are less likely to happen.

Thus, it is quite common to use a hashtable as large as 256 to 512Mo if you want, although 256Mo is often enough, but the larger is your hashtable, the less are collisions, the more accurate is your probability. Recall from the beginning of this post, and you should understand now why "crinkler can take several hundreds of megabytes to decompress"... simply because of this hashtable that store all the probabilities for the next bit for all models combination used.

If you are familiar with crinkler, you already know the option to find a best possible hashsize for an initial hashtable size and a number of tries (hashtries option). This part is responsible to test different size of hashtable (like starting from 100Mo, and reducing the size by 2 bytes 30 times, and test the final compression) and test final compression result. This is a way to empirically reduce collision effects by selecting the hashsize that is giving the better compression ratio (meaning less collisions in the hash). Although this option is only able to help you save a couple of bytes, no more.


Data reordering and type of data


Reordering or organizing differently the data to have a better compression is one of the common technique in compression methods. Sometimes for example, Its better to store deltas of values than to store values themselves...etc.

Crinkler is using this principle to perform data reordering. At the linker level, crinkler has access to portion of datas and code, and is able to move those portions around in order to achieve a better compression ratio. This is really easy to understand : suppose that you have a series initialized zero values in your data section. If those values are interleaved with non zero values, the counter probabilities will switch from "there are plenty of zero there" to "ooops, there are some other datas"... and the final probability will balance between 90% to 20%. Grouping data that are similar is a way to improve the overall probability correctness.

This part is the most time consuming, as It needs to move and arrange all portions of your executable around, and test which arrangement is giving the best compression result. But It's paying to use this option, as you may be able to save 100 bytes in the end just with this option.

One thing that is also related to data reordering is the way crinkler is handling separately the binary code and the data of your executable. Why?, because their binary representation is different, leading to a completely different set of probabilities. If you look at the selected models for ergon, you will find that code and data models are quite different. Crinkler is using this to achieve better performance here. In fact, crinkler is compressing completely separately the code and the datas. Code has its own models and weights, Data another set of models and weights. What does it means internally? Crinkler is using a set of model and weights to decode the code section of your exectuable. Once finished, It will erase the probability counters stored in the hashtable-dictionary, and go to the data section, with new models and weights. Reseting all counters to 0 in the middle of decompressing is improving compression by a factor of 2-4%, which is quite impressive and valuable for a 4k (around 100 to 150 bytes).

I found that even with an adaptive model (with a neural networks dynamically updating the weights), It is still worth to reset the probabilities between code and data decompression. In fact, reseting the probabilities is an empirical way to instruct the context modeling that datas are so different that It's better to start from scratch with new probability counters. If you think about it, an improved demo compressor (for larger exectuable, for example under 64k) could be clever to detect those portions of datas that are enough different that It would be better to reset the dictionary than to keep it as it is.

There is just one last thing about weights handling in crinkler. When decoding/encoding, It seems that crinkler is artificially increasing the weights for the first discovered bit. This little trick is improving compression ratio by about 1 to 2% which is not bad. Having higher weights at the beginning enable to have a better response of the compressor/decompressor, even If it doesn't still have enough data to compute a correct probability. Increasing the weights is helping the compression ratio at cold start.

Crinkler is also able to transform the x86 code for the executable part to improve compression ratio. This technique is widely used and consist of replacing relative jump (conditionnal, function calls...etc.) to absolute jump, leading to a better compression ratio.

Custom DLL LoadLibrary and PE file optimization


In order to strip down the size of an executable, It's necessary to exploit as much as possible the organization of a PE file.

First thing that crinkler is using is that lots of part in a PE files are not used at all. If you want to know how a windows executable PE files can be reduced, I suggest you read Tiny PE article, which is a good way to understand what is actually used by a PE loader. Unlike the Tiny PE sample, where the author is moving the PE header to the dos header, crinkler made the choice to use this unused place to store hash values that are used to reference DLL functions used.

This trick is called import by hashing and is quite common in intro's compressor. Probably what make crinkler a little bit more advanced is that to perform the "GetProcAddress" (which is responsible to get the pointer to a function from a function name), crinkler is navigating inside internal windows process structure in order to directly get the address of the functions from the in-memory import table. Indeed, you won't find any import section table in a crinklerized executable. Everything is re-discovered through internal windows structures. Those structures are not officially documented but you can find some valuable information around, most notably here.

If you look at crinkler's code stored in the crinkler import section, which is the code injected just before the intros start, in order to load all dll functions, you will find those cryptics calls like this:
//
    (0) MOV         EAX, FS:[BX+0x30]
    (1) MOV         EAX, [EAX+0xC]
    (2) MOV         EAX, [EAX+0xC]
    (3) MOV         EAX, [EAX]
    (4) MOV         EAX, [EAX]
    (5) MOV         EBP, [EAX+0x18]


This is done by going through internal structures:
  • (0) first crinklers gets a pointer to the "PROCESS ENVIRONMENT BLOCK (PEB)" with the instruction  MOV EAX, FS:[BX+0x30]. EAX is now pointing to the PEB 
Public Type PEB 
InheritedAddressSpace As Byte
    ReadImageFileExecOptions As Byte
    BeingDebugged As Byte
    Spare As Byte
    Mutant As Long
    SectionBaseAddress As Long
    ProcessModuleInfo As Long // <---- PEB_LDR_DATA
    ProcessParameters As Long // RTL_USER_PROCESS_PARAMETERS
    SubSystemData As Long
    ProcessHeap As Long
    ... struct continue

  • (1) Then it gets a pointer to the "ProcessModuleInfo/PEB_LDR_DATA" MOV EAX, [EAX+0xC]
Public Type _PEB_LDR_DATA
    Length As Integer
    Initialized As Long
    SsHandle As Long
    InLoadOrderModuleList As LIST_ENTRY  // <---- LIST_ENTRY InLoadOrderModuleList
    InMemoryOrderModuleList As LIST_ENTRY
    InInitOrderModuleList As LIST_ENTRY
    EntryInProgress As Long
End Type

  • (2) Then it gets a pointer to get a pointer to the next "InLoadOrderModuleList/LIST_ENTRY" MOV EAX, [EAX+0xC].
Public Type LIST_ENTRY    Flink As LIST_ENTRY
    Blink As LIST_ENTRY
End Type

  • (3) and (4) Then it navigates through the LIST_ENTRY linked list MOV EAX, [EAX]. This is done 2 times. First time, we get a pointer to the NTDLL.dll, second with get a pointer to the KERNEL.DLL. Each LIST_ENTRY is in fact followed by the structure LDR_MODULE :

Public Type LDR_MODULE
    InLoadOrderModuleList As LIST_ENTRY
    InMemoryOrderModuleList As LIST_ENTRY
    InInitOrderModuleList As LIST_ENTRY
    BaseAddress As Long
    EntryPoint As Long
    SizeOfImage As Long
    FullDllName As UNICODE_STRING
    BaseDllName As UNICODE_STRING
    Flags As Long
    LoadCount As Integer
    TlsIndex As Integer
    HashTableEntry As LIST_ENTRY
    TimeDateStamp As Long
    LoadedImports As Long
    EntryActivationContext As Long ‘ // ACTIVATION_CONTEXT
    PatchInformation As Long
End Type

Then from the BaseAddress of the Kernel.dll module, crinkler is going to the section where functions are already loaded in memory. From there, the first hashed function that is stored by crinkler is LoadLibrary function. After this, crinkler is able to load all the depend dll and navigate through the import tables, recomputing the hash for all functions names for dependent dlls, and is trying to match the hash stored in the PE header. If a match is found, then the function entry point is stored.

This way, crinkler is able to call some OS functions stored in the Kernel.DLL, without even linking explicitly to those DLL, as they are automatically loaded whenever a DLL is loaded. Thus achieving a way to import all functions used by an intro with a custom import loader.

Compression results


So finally, you may ask, how much crinkler is good at compressing? How does it compare to other compression method? How does look like the entropy in a crinklerized exe?

I'll take the example of Ergon exe. You can already find a detailed analysis for this particular exe.

Comparison with other compression methods


In order to make a fair comparison between crinkler and other compressors, I have used the data that are actually compressed by crinkler after the reordering of code and data (This is done by unpacking a crinklerized ergon.exe and extracting only the compressed data). This comparison is accurate in that all compressors are using exactly the same data.

In order also to be fair with crinkler, the size of 3652 is not taking into account the PE header + the crinkler decompressor code (which in total is 432 bytes for crinkler).

To perform this comparison, I have only used 7z which has at least 3 interesting methods to test against :
  • Standard Deflate Zip
  • PPMd with 256Mo of dictionary
  • LZMA with 256Mo of dictionary
I have also included a comparison with a more advanced packing method from Matt Mahoney resource, Paq8l which is one of the version of PAQ methods, using neural networks and several context modeling methods.

Program Compression Method Size in bytes Ratio vs Crinkler
none uncompressed 9796
crinkler ctx-model 256Mo 3652 +0,00%
7z deflate 32Ko 4526 +23,93%
7z PPMd 256Mo 4334 +18,67%
7z LZMA 256Mo 4380 +19,93%
Paq8l dyn-ctx-model 256Mo 3521 -3,59%

As you can see, crinkler is far more efficient than any of the "standard" compression method (Zip, PPMd, LZMA). I'm not even talking about the fact that a true comparison would be to include the decompressor size, so the ratio should certainly be worse for all standard methods!

Paq8l is of course slightly better... but if you take into account that Paq8l decompressor is itself an exe of 37Ko... compare to the 220 byte of crinkler... you should understand now how much crinkler is highly efficient in its own domain! (remember? 4k!)

Entropy


In order to measure the entropy of crinkler, I have developed a very small program in C# that is displaying the entropy of an exe. From green color (low entropy, less bits necessary to encode this information) to red color (high entropy, more bits necessary to encode this information).

I have done this on 3 different ergon executable :
  • The uncompressed ergon.exe (28Ko). It is the standard output of a binary exe with MSVC++ 2008.
  • The raw-crinklerized ergon.exe extracted code and data section, but not compressed (9796 bytes)
  • The final crinklerized ergon.exe file (4070 bytes)
Ergon standard exe entropy
Ergon code and data crinklerized, uncompressed reordered data
Ergon executable crinklerized
As expected, the entropy is fairly massive in a crinklerized exe. Compare with the waste of information in a standard windows executable. Also, you can appreciate how much is important the reordering and packing of data (no compression) that is perform by crinkler.

Some notes about the x86 crinkler decompressor asm code


I have often talked about how much crinkler decompressor is truly a piece of x86 art.  It is hard to describe the technique used here, there are lots of x86 standard optimization and some really nice trick. Most notably:
  1. using all the registers
  2. using intensively the stack to save/restore all the registers with pushad/popad x86. This is for example done (1 + number_of_model) per bit. If you have 15 models, there will be a total of 16 pushad/popad instructions for a single bit to be decoded! You may wonder why making so many pushes? Its the only way to efficiently use all the registers (rule #1) without having to store particular registers in a buffer. Of course, push/pop instruction is also used at several places in the code as well.
  3. As a result of 1) and 2), apart from the hash dictionnary, no intermediate structure are used to perform the context modeling calculation.
  4. Deferred conditional jump: Usually, when you perform some conditional testing with x86, this is often immediately followed by a conditional jump (like cmp eax, 0; jne go_for_bla). In crinkler, sometimes, a conditionnal test is done, and is used several instruction laters. (for example. cmp eax,0; push eax; mov eax, 5; jne go_for_bla <---- this is using the result of cmp eax,0 comparison). It makes the code to read a LOT harder. Sometimes, the conditional is even used after a direct jump! This is probably one part of crinkler's decompressor that impressed me the most. This is of course something quite common if you are programming heavily optimized-size x86 asm code... you need to know of course which instructions is not modifying CPU flags in order to achieve this kind of optimization!

Final words


I would like to apologize for the lack of charts, pictures to explain a little bit how things are working.  This article is probably still obscure for a casual reader, and should be considered as a draft version. This was a quick and dirty post. I wanted to write this for a long time, so here it is, not perfect as it should be, but this may be improved in future versions!

As you can see, crinkler is really worth to look at. The effort to make it so efficient is impressive and there is almost no doubt that there won't be any other crinkler competitor for a long time! At least for a 4k executable. Above 4k, I'm quite confident that there are still lots of area that could be improved, and probably kkrunchy is far from being the ultimate packer under 64k... Still, if you want a packer, you need to code it, and that's not so trivial!

19 comments:

  1. Nice article, I've been meaning to write one like this for a while too!

    ReplyDelete
  2. Nice write up. Got a headache now but still. well done... ;-)

    ReplyDelete
  3. Wow, best artikel ever read about compressing methode !!!

    ReplyDelete
  4. Great article, and amazingly accurate! Quite impressive work you have done there dissecting all the little tidbits of Crinkler's header and import code. Must have been loads of fun. ;)

    I only spotted three minor errors:

    - To get the actual model weight used, rather than adding one, it is exponentiated, i.e. for weight w, the counter values for that model are multiplied by 2^w.

    - Since a particular context can only occur once per byte, the counters wrap once every 256 bytes, rather than 256 bits (for areas where all bytes have the same value).

    - The import code does not call GetProcAddress as such. Rather, it digs into internal DLL structures to imitate the functionality of GetProcAddress. If we were to use GetProcAddress for this, we would run into a chicken-and-egg problem in getting the address of GetProcAddress in the first place. This is also the reason Crinkler does not support forwarded RVAs, which is a different way of storing the function reference internally in the DLL.

    To answer your question about the model selection process, it is actually not very clever. We step through the models in bit-mirrored numerical order (i.e. 00, 80, 40, C0, 20 etc.) and for each step do the following:

    - Check if compression improves by adding the model to the current set of models (taking into account the one extra byte to store the model).

    - If so, add the model, and then step through every model in the current set and remove it if compression improves by doing so.

    The difference between FAST and SLOW compression is that SLOW optimizes the model weights for every comparison between model sets, whereas FAST uses a heuristic for the model weights (number of bits set in the model mask).

    ReplyDelete
  5. Thanks all for your kind comments and happy to see that It was interesting although It is still lacking lots of explanation, diagrams to be easier to read!

    Hey Blueberry! glad to see you around and to learn that this post is not completely wrong! :)

    Thanks for spotting these errors, I have fixed them! For the weights, indeed, I now recall the 2^w, but I didn't look at the code when I wrote this article... and probably, one and a half year later, I have oversimplified the way it was calculated!

    Woops, of course for the counter wrapping... there is a context for every single bit that is read... so yep, It's every 256 byte that the probability is reseted!

    GetProcAddress, oops, again, you are right, you are mimicking GetProcAddress but you are not calling it. The only function that is called explicitly from the import code is the LoadLibrary function which was recovered from the handwritten GetProcAddress.

    And thanks for the model selection explanation! Weird, this is almost the method I tried while using dynamic neural networks, removing other models and testing compression once a model was selected... but I did it probably wrong... need to check this.

    ReplyDelete
  6. Great article! Didn't know this blog - and subscribed!

    ReplyDelete
  7. @lx: sorry for being a noob, but how would you use crinkler with .net executables? I tried but crinkler complains about "unsupported file type"

    ReplyDelete
  8. @Jason: crinkler is a linker-compressor and is only working directly with plain native .obj generated from a c/c++/asm compiler.

    Thus, It won't work with .net executables. Although you could use crinkler (or better kkrunchy, as crinkler is targetted to 4Ko exe and won't work for any compressed exe > 64k) with .net executables, but it requires to develop a mini-native exe acting as a CLR host that will load the executable assembly (embedded into the native exe in whatever mode, resources, plain property... etc.). The drawback of this solution is that you won't be able to run in anycpu but only in x86 platform.

    It's better to rely on pure .Net solution for .Net assemblies. Unfortunately, none of the current existing solutions are good packers (netz, sixxpack, exepack, mpress) compare to context modeling technique. I have developed already a .net exe compressor which is using context-modeling, with far better compression ratio than other methods, but It's not released yet (I will try to release it during this year)

    ReplyDelete
  9. ahh okay, i see. yes I was wondering about how it would compare with other zip based packers, especially currious about how it would work with the metadata stored in the .net assembly, but I guess it doesnt ;)

    yah i'd be very interested to see how your content modeling .net compressor works!!

    ReplyDelete
  10. >> I was wondering about how it would compare with other zip based packers
    From my test, It's a lot more efficient. On the test I did on a .net executables of 64Ko:
    - mpress 32Ko
    - zip-method 50Ko
    - context-modeling-method 20Ko

    ReplyDelete
  11. Superb article, ever since I first encountered Crinkler I have wanted to learn more about how it works. Thanks for taking the time to collect all this information and sharing it.

    ReplyDelete
  12. Fantastic article, really well written, researched, and presented. It really made my day to read this, and your analysis/explanation of PAQ-style encoders is great for those who haven't been exposed to them.

    ReplyDelete
  13. Fantastic article :) I would've appreciated some references for further review and possibly some "bug fixing" (things like mentioning how Crinkler uses GetProcAddress is wrong, for example), but aside from that, this is ace! Thanks for the read :)

    ReplyDelete
  14. Amazing article. It was extremely well worded and yet absolutely fantastic with a lot of details. Kudos for the thorough and detailed research.

    ReplyDelete
  15. Thanks Ferris and Decipher!

    Ferris, could you elaborate on missing references, what would be helpful? (I put mainly Mahoney that is the main reference here for context-modeling with neural networks). Also, what is exactly wrong in description of crinkler simulating GetProcAddress? (I though It was incomplete but not wrong! :) For example, I'm not detailing how crinkler is navigating inside the IAT module in-memory, after getting the address of the DLL module... but apart from that, I should not be wrong)

    ReplyDelete
  16. What you had was correct about moving through the loaded .dll structures :) I was referring to the use of GetProcAddress in the first place - one of the benefits of import by hash is the complete bypass of this function! With the actual hash and moving through the .dll header, you're actually emulating GetProcAddress' functionality altogether :) .

    Another correction (if I might add) is that Crinkler actually DOES embed the PE header inside the MZ header, just a few bytes after where the TinyPE example does. If you look at a Crinkler .exe's entry point, it's between the "MZ" and "PE" symbols near the very beginning of the file. But you're completely right about it using most of this space to store hash :) .

    And about references, I'm not accusing you of plagiarism or anything, I was just left more curious about some of the things (particularly arithmetic encoding and context modelling) and failed to notice many of the links you've included. Apologies :) .

    I'd like to reiterate, though, how awesome I find this article :) . Seriously, well done.

    ReplyDelete
  17. ah, apologies again, it seems you said exactly what I just wrote about emulating GetProcAddress. Sorry again :)

    ReplyDelete
  18. Oh ok, thanks for your feedback!

    I agree that GetProcAddress part and PE file optim is far from being complete. I wrote this part in a couple of minutes just before releasing the article... and was a bit too lazy to explain more in-depth how things are tightly organized! So you are right about mentioning the PE/MZ header collapsing, as this is the first thing you can see when you inspect a crinklerized PE.

    About references, yeah, links are spread into the document and I didn't assemble them at the bottom of this post. The thing also is that I worked on dissecting crinkler and coding my own context-modeling compressor around mid-2009, while I wrote this article in dec-2010... in the meantime, I forgot lots of details that would have been even more helpful here! Also, I worked a lot more on trying several context-modeling compression/technique than to dissect crinkler (that was just done in a few days)... all this work on context-modeling for developing small compressors is worth an article in itself, as context-modeling compressor is quite a fascinating method!

    ReplyDelete