I had this curious thought the other day: what’s the byte value distribution in binary files, such as an executable? Take for example /bin/echo on OS X 10.11.1.

What we want to find out more precisely is: how likely does a certain byte value appear in a file? And how does it look like when we visualize that distribution?

This is going to be a field day thanks to a few handy classes and methods in the Python 3 standard library. Let’s get started then.

Reading the file

While reading binary files, we need to make sure that we open a file in binary mode. You can achieve this by passing the "rb" flag to the open function. This flag tells Python and the operating system to open the file read-only and to interpret the contents as raw bytes. The r stands for read-only and the b stands for binary.

The b flag keeps the Python runtime from trying to decode the contents of the file and instead gives them to as raw bytes. If we don’t do this, Python throws a UTF-8 decoding error. The binary file /bin/echo contains bytes that Python can’t decode into valid UTF-8 characters.

This is one of the biggest changes in Python 3, compared to Python 2. The way the two handle byte strings and Unicode strings can be a challenge when porting code from Python 2 to 3. Yet, it makes things a lot easier once you’re done porting. Byte strings, which are either of type bytes or bytearray in Python 3, have a different intent than UTF-8 strings, which are of type str.

Byte strings let you process binary data, such as executables, serialized files and more. UTF-8 strings are for handling natural languages. This could be an HTML template using German with lots of umlauts, or a regular expression looking through texts written in Japanese.

We also use the open() context manager. You can open files and Python closes and garbage collects file descriptors immediately after leaving the context manager. This makes exception handling easier as well. The context manager closes file descriptors even in when the inner code raises an exception.

Once the inner code raises an exception, the context manager first frees all resources before re-raising this exception. This means that you can have your cake and eat it too. Python cleanly handles resource access and you can have clean exception handling, too.

Note also that byte-wise access

with open("/bin/echo", "rb") as fd:
    contents = fd.read()

A lot of thought has gone into just two lines of Python code. That’s what I like about the language. It’s well designed for these purposes.

Counting bytes

Here, we use the Counter class in the collections module. The Counter class counts the occurrence of unique elements in an iterable object, such as the byte contents of a file. Let’s get going.

from collections import Counter
c = Counter(contents)

This creates a Counter instance and it stores the absolute frequency of the same items over the whole contents bytes list. The Counter class supports basic operations like retrieving the most common elements or subtracting one Counter instance from another.

You can retrieve the most common byte by calling the most_common function like in the following snippet:

for value, frequency in c.most_common(10):
    print("0x{:02x}: {}".format(value, frequency))

0x00: 12309
0x01: 215
0x30: 149
0x65: 134
0x74: 130
0x20: 127
0x69: 108
0x03: 100
0x72: 97
0x06: 94

It’s interesting to see that the 0x00 byte is the most frequent value. I am not knowledgeable on the topic of executable file formats, but I suspect it’s due to the padding of the file’s segments.

Getting to the probability distribution

We not only want to calculate the absolute distribution, meaning, how many times a certain byte value occurs. We’re also interested in calculating the probability distribution for all occurring bytes. With the probability distribution we know what the likelihood is that the next byte we read from /bin/echo has a certain value.

The function probability_distribution that I define here can calculate this probability distribution for us. I went ahead and created a helper method. This helper method uses a Counter object and the file length to adjust every byte value count by the number of bytes in the file. Inside of the helper method there is a generator function that loops over a Counter instance and runs the calculation.

We need to wrap the generator inside of a dict, so that the Counter doesn’t count the frequency of value-frequency tuples. These tuples are all unique.

Instead it should apply a dict to itself and let us use all the extra features that the Counter class offers.

def probability_distribution(content):
    def _helper():
        absolute_distribution = Counter(content)
        length = len(content)
        for value, frequency in absolute_distribution.items():
            yield int(value), float(frequency) / length
    return Counter(dict(_helper()))

c = probability_distribution(contents)

Now, if we want to see the most common elements, we can do the same as before:

print("Probability Distribution")
for value, frequency in c.most_common(10):
    print("0x{:02x}: {:.04f}".format(value, frequency))

Probability Distribution
0x00: 0.6826
0x01: 0.0119
0x30: 0.0083
0x65: 0.0074
0x74: 0.0072
0x20: 0.0070
0x69: 0.0060
0x03: 0.0055
0x72: 0.0054
0x06: 0.0052

The advantage with this way of displaying a value distribution is that we can compare it to other files more directly. The absolute distribution can change depending on the file length, but the relative distribution of certain values might not. This applies to padding bytes as well.

For example, if we run our script on /bin/pwd instead, we see the following probability distribution:

Probability Distribution
0x00: 0.6911
0x01: 0.0120
0x30: 0.0081
0x74: 0.0078
0x65: 0.0074
0x20: 0.0072
0x5f: 0.0060
0x69: 0.0058
0x70: 0.0054
0x03: 0.0053

We can see that the first 6 byte values appear in /bin/echo as well, in a slightly different ranking.

For the great finale, we visualize the distribution in the terminal.

max_prob = c.most_common(1)[0][1]
for value, frequency in c.most_common():
    print("0x{:02x}: {}".format(value, "█" * int(frequency * 80/max_prob)))

0x00: ████████████████████████████████████████████████████████████████████████████████
0x01: █
0x30:
0x74:
0x65:
0x20:
0x5f:
...

That’s not too interesting. If we ignore the 0 byte, things look a lot more interesting.

c = probability_distribution(list(filter(bool, contents)))

max_prob = c.most_common(1)[0][1]

for value, frequency in c.most_common():
    print("0x{:02x}: {}".format(value, "█" * int(frequency * 80/max_prob)))

0x01: ████████████████████████████████████████████████████████████████████████████████
0x30: █████████████████████████████████████████████████████
0x74: ███████████████████████████████████████████████████
0x65: █████████████████████████████████████████████████
0x20: ███████████████████████████████████████████████
0x5f: ███████████████████████████████████████
0x69: ██████████████████████████████████████
0x70: ███████████████████████████████████
0x03: ███████████████████████████████████
0x06: ███████████████████████████████████
0x72: ████████████████████████████████
0xff: █████████████████████████████
0x04: ████████████████████████████
0x63: ████████████████████████████
0x02: ██████████████████████████
0x61: ██████████████████████████
0x6e: █████████████████████████
0x6f: ████████████████████████
0x31: ██████████████████████
...

That looks interesting. Note that in both cases we’ve scaled the values, and show the value with the highest probability using 80 bar characters.

Byte Histogram in Python 3

Reading the file

Counting bytes

Getting to the probability distribution

Tags