I had this curious thought the other day: what is the byte value distribution
in binary files, such as an executable? Take for example /bin/echo
on OS X
10.11.1.
What we want to find out more precisely is: how likely does a certain byte value appear in a file? And how does it look like when we visualize that distribution?
This is going to be a field day thanks to a few handy classes and methods in the Python 3 standard library. Let’s get started then.
Reading the file
While reading binary files, we need to make sure that we open a file in binary
mode. This can be accomplished by passing the "rb"
(open for reading and
enable binary mode) flag to the open function.
This keeps the Python runtime from decoding the contents of the file and
instead allows us to access the raw bytes. Otherwise we would get a UTF-8
decoding error, as our binary file /bin/echo
contains bytes that cannot be
decoded into valid UTF-8 characters.
This is one of the biggest changes in Python 3, compared to Python 2. The way
byte strings and unicode strings are being handled can be a challenge when
porting code from Python 2 to 3, but makes things a lot easier once it’s done.
Byte strings, which are either of type bytes
or bytearray
in Python 3, have
a different intent than UTF-8 strings, which are of type str
. While byte
strings are, as their name says, for handling binary data, such as executables,
serialized files and more, UTF-8 strings are for handling natural languages.
This could be an HTML template using German with lots of umlauts, or a regular
expression looking through texts written in Japanese.
We also make use of the open()
context manager, which is a good habit for
reasons. It allows open files and their file descriptors to be garbage
collected immediately after leaving the context manager and it makes exception
handling easier, as the context manager closes file descriptors even in the
presence of exceptions and passes the exceptions on after all resources have
been freed.
Note also that byte-wise access
with open("/bin/echo", "rb") as fd:
contents = fd.read()
A lot of thought has gone into just two lines of Python code. That’s what I like about the language. It’s well designed for these purposes.
Counting bytes
Now, we are going to make use of the
Counter
class in the collections
module. The Counter
class allows us to easily
count the occurrence of individual elements in an iterable
object, such as
the byte contents of a file. Let’s get going.
from collections import Counter
c = Counter(contents)
What we have now is a Counter
instance, that stores the absolute frequency of
the same items over the whole contents
bytes list. The Counter
class
supports basic operations like retrieving the most common elements or
subtracting one Counter
instance from another. The most common byte can be
retrieved like so
for value, frequency in c.most_common(10):
print("0x{:02x}: {}".format(value, frequency))
0x00: 12309
0x01: 215
0x30: 149
0x65: 134
0x74: 130
0x20: 127
0x69: 108
0x03: 100
0x72: 97
0x06: 94
It’s interesting to see that the 0x00
byte is the most frequent value. I am
not knowledgeable on the topic of executable file formats, but I suspect it’s
due to the padding of the file’s segments.
Getting to the probability distribution
What we want to calculate is not only the absolute distribution, that is, how
many times a certain byte value occurs, but also the probability distribution.
With the probability distribution we know instead what the likelihood is that
the next byte we read from /bin/echo
has a certain value.
To get to a probability distribution, we need to do some calculations. I went
ahead and created a helper method, that uses a Counter
object and the length
of the file contents to adjust every byte value count by the number of bytes in
the file. Inside of the helper method there is a generator function that loops
over a Counter
instance and applies the calculation.
We need to wrap the generator inside of a dict, so that the Counter
does not
count the frequency of value-frequency tuples. These tuples are all unique.
Instead it should apply a dict to itself and allow us to use all the extra
features that the Counter
class offers.
def probability_distribution(content):
def _helper():
absolute_distribution = Counter(content)
length = len(content)
for value, frequency in absolute_distribution.items():
yield int(value), float(frequency) / length
return Counter(dict(_helper()))
c = probability_distribution(contents)
Now, if we want to see the most common elements, we can do the same as before:
print("Probability Distribution")
for value, frequency in c.most_common(10):
print("0x{:02x}: {:.04f}".format(value, frequency))
Probability Distribution
0x00: 0.6826
0x01: 0.0119
0x30: 0.0083
0x65: 0.0074
0x74: 0.0072
0x20: 0.0070
0x69: 0.0060
0x03: 0.0055
0x72: 0.0054
0x06: 0.0052
The advantage with this way of displaying a value distribution is that we can compare it to other files more easily compared to the absolute distribution as file lengths can vary, but the distribution of certain values might not. This could for example apply to padding bytes.
For example, if we run our script on /bin/pwd
instead, we see the following
probability distribution:
Probability Distribution
0x00: 0.6911
0x01: 0.0120
0x30: 0.0081
0x74: 0.0078
0x65: 0.0074
0x20: 0.0072
0x5f: 0.0060
0x69: 0.0058
0x70: 0.0054
0x03: 0.0053
We can see that the first 6 byte values appear in /bin/echo
as well, in a
slightly different ranking.
For the great finale, we visualize the distribution in the terminal.
max_prob = c.most_common(1)[0][1]
for value, frequency in c.most_common():
print("0x{:02x}: {}".format(value, "█" * int(frequency * 80/max_prob)))
0x00: ████████████████████████████████████████████████████████████████████████████████
0x01: █
0x30:
0x74:
0x65:
0x20:
0x5f:
...
That’s not too interesting. If we ignore the 0 byte, things look a lot more interesting.
c = probability_distribution(list(filter(bool, contents)))
max_prob = c.most_common(1)[0][1]
for value, frequency in c.most_common():
print("0x{:02x}: {}".format(value, "█" * int(frequency * 80/max_prob)))
0x01: ████████████████████████████████████████████████████████████████████████████████
0x30: █████████████████████████████████████████████████████
0x74: ███████████████████████████████████████████████████
0x65: █████████████████████████████████████████████████
0x20: ███████████████████████████████████████████████
0x5f: ███████████████████████████████████████
0x69: ██████████████████████████████████████
0x70: ███████████████████████████████████
0x03: ███████████████████████████████████
0x06: ███████████████████████████████████
0x72: ████████████████████████████████
0xff: █████████████████████████████
0x04: ████████████████████████████
0x63: ████████████████████████████
0x02: ██████████████████████████
0x61: ██████████████████████████
0x6e: █████████████████████████
0x6f: ████████████████████████
0x31: ██████████████████████
...
That looks interesting. Note that in both cases the values are scaled, so that the value with the highest probability is shown with 80 bar characters.