Memory Usage of Pandas Categoricals

Published: February 23, 2018, updated: January 4, 2025

A great method to improve memory usage of Pandas DataFrames is by converting columns with categorical variables to use the data type categorical.

The Pandas documentation explains further how to use this data type in your columns here. In this article we want to explore what memory savings we can typically expect when working with them.

1. Setup

First, we import Pandas, NumPy, and Matplotlib. Furthermore, we adjust the amount of rows shown when displaying Pandas DataFrames in Jupyter notebooks, and the default figure size for Matplotlib.

%matplotlib inline
import string

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option('display.max_rows', 6)

plt.rcParams["figure.figsize"] = 10, 5

2. Test data

Next, we create test data. Since we’re concerned with efficiently storing categorical variables, we create random strings that we load into a Pandas Series.

First, we define the function that creates n random words. Here, we simply choose random letters out of string.ascii_letters, which can be found in the Python 3 standard library. We also seed numpy.random with 1 just to get consistent results.

def random_words(n, word_length=8):
    np.random.seed(1)
    random_chars = np.random.choice(
        list(string.ascii_letters),
        (n, word_length),
    )
    return np.array([
        "".join(word)
        for i, word in enumerate(random_chars)
    ])

Let’s take a look at two example random words.

random_words(2)

Output:

array(['LRmijlfp', 'aqbmhTgz'], dtype='<U8')

We then move on to create the function that gives us a random Pandas Series of words. For this, we invoke the random_words function we created before and choose as many different words as we need to create a Series of length length with n_categories different categories.

We also allow specifying a third argument called categorical, which lets us decide whether we want the data to be stored as plain Python objects or memory efficient Pandas categorical values. This is important for later benchmarking.

def random_words_series(length, n_categories, categorical=True):
    np.random.seed(1)
    categories = random_words(n_categories)
    words = np.random.choice(
        categories,
        length,
    )
    return pd.Series(
        words,
        dtype='category' if categorical else object,
    )

As an example, this is how to create a series of random words with 3 different categories and a total series length 6.

example = random_words_series(
    6, 3
)
example.to_frame()

Output:

0
0 aqbmhTgz
1 YuLsulQC
2 YuLsulQC
3 LRmijlfp
4 aqbmhTgz
5 aqbmhTgz

We can then easily see that this Series has exactly 3 categories by inspecting its .dtype attribute.

len(example.dtype.categories)

Output:

3

Now, to compare a few examples, we create Series with 3 different amounts of categories: 1, 10, and 100. Each Series has 10000 values. Furthermore, for comparison we create each Series twice: once with the data type set to categorical and once to just using Python objects. We get 6 series in total.

The 3 series ending on _category contain data stored as category data in Pandas. The 3 series ending on _object contain data stored as Python objects.

n = 10000

series = {
    'one_category': random_words_series(
        n, 1,
    ),
    'several_categories': random_words_series(
        n, 10,
    ),
    'many_categories': random_words_series(
        n, 100,
    ),
    'one_object': random_words_series(
        n, 1, False,
    ),
    'several_objects': random_words_series(
        n, 10, False,
    ),
    'many_objects': random_words_series(
        n, 100, False,
    ),
}

After we’ve created the 6 different series using our random_words_series method, we want to move on to analyzing memory usage. This allows us to find out how big the memory savings are.

3. Memory

For each example series, we analyze the memory usage by using the standard Pandas .memory_usage() method. We print it out immediately.

memory_usage = pd.Series(
    {k: v.memory_usage(deep=True) for k, v in series.items()},
    name='memory_usage',
    dtype='uint64',
).sort_values()
memory_usage.to_frame()

Output:

memory_usage
one_category 10225
several_categories 11050
many_categories 21700
many_objects 650080
one_object 650080
several_objects 650080

We can see that category data in Pandas use considerably less memory than plain Python objects.

Furthermore, we print out a logarithmic plot of memory size for each Series.

fig, ax = plt.subplots()
ax.set_ylabel('Bytes')
memory_usage.to_frame().plot(
    kind='bar',
    logy=True,
    ax=ax,
);

Output:

Bar chart showing memory consumption for each kind of DataFrame created just now

Bar chart showing memory consumption for each kind of DataFrame created just now Open in new tab (full image size 4 KiB)

Finally, we show how memory usage behaves when the amount of different categories inside a Series grows.

4. Memory usage as function of size

For this analysis, we create 2000 series with up to 2000 different categories. As before, to allow us to understand the memory usage of category data in Pandas, we create the series once using the Pandas category data type and once just using plain Python objects.

First, we calculate the memory usage for the plain Python object series:

size_n = 2000
size = pd.Series(
    {
        i: random_words_series(
            size_n, i, False,
        ).memory_usage(deep=True)
        for i in range(1, size_n + 1)
    },
    name="Memory usage for n categories (object data type)",
).sort_values()

Right after that, we calculate the memory usage for all categorical series:

size_categorical = pd.Series(
    {
        i: random_words_series(
            size_n, i,
        ).memory_usage(deep=True)
        for i in range(1, size_n + 1)
    },
    name="Memory usage for n categories (categorical data type)",
).sort_values()

Now, we can put the sizes next to each other in a Pandas DataFrame. We immediately see that despite the number of categories nearing 2000, the memory usage never comes close to that of the Series using plain Python objects.

sizes = pd.DataFrame([size, size_categorical]).T
sizes

Output:

Memory usage for n categories (object data type) Memory usage for n categories (categorical data type)
1 130080 2225
2 130080 2290
3 130080 2355
1998 130080 127460
1999 130080 127395
2000 130080 127460

We can plot these values against each other. Since the Python object Series always contain the same amount of data regardless of the number of categories we see that the memory usage is constant as well. For the category data type series we see an increase in memory usage, but it never touches the constant line of the Python object series in the plot.

fig, ax = plt.subplots(1)
ax.set_xlabel('Random Words')
ax.set_ylabel('Bytes')
sizes.plot(
    ax=ax
);

Output:

Chart showing memory usage for categorical and regular DataFrames as they grow in size

Chart showing memory usage for categorical and regular DataFrames as they grow in size Open in new tab (full image size 7 KiB)

At this point we should ask ourselves whether string data should always be stored as categorical data in Pandas. While one would assume that the added indirection should provide no advantage when using non-categorical values, the memory usage analysis here implies otherwise. We need to further inspect whether the method to determine memory usage is sound, and whether Pandas .memory_usage(deep=True) correctly takes in account all nested data. On the other hand, with a low number of categories the memory savings are significant. Here it always makes sense to switch the data type to categorical.

If you are dealing with category data, using the Pandas category data type improves performance.

Tags

I would be thrilled to hear from you! Please share your thoughts and ideas with me via email.

Back to Index