1. Introduction

The other day I caught myself writing one too many df.groupby() in Pandas, and thought to myself: isn’t there a way I can simplify and reduce the amount of df.groupby() calls?

What I group by hardly changes and only the operation that follows a group by varies from IPython cell to IPython cell. Aggregates are coming to the rescue.

With Pandas DataFrames, you can perform many useful statistical calculations. One of the most useful features in Pandas is the .aggregate() function. Using .aggregate(), a user can perform many calculations on a group by object at once. This is handy in many situations and is much faster than calculating all required .aggregate() values in separate steps.

Finding out how flexible .aggregate is while writing this article was a pleasant surprise.

2. Setup

First, we import pandas to create the DataFrames used in the following examples.

import pandas as pd

We want to define a DataFrame with a variety of illustrative data types.

The example that we come up with here is a list of fruits that our fictitious friends Franz, Hans and Gerhard have eaten in the last week. Furthermore, we note down whether our friends have actually liked the fruit or not.

The columns are:

index: Name of person
Fruit: Fruit consumed
Satisfaction: Satisfaction with consumed fruit
Weight: Weight of the consumed fruit in gram

df = pd.DataFrame(
    [
        ['Apple', 'full', 100],
        ['Orange', 'none', 200],
        ['Pear', 'full', 300],
        ['Pear', 'partial', 100],
        ['Banana', 'full', 400],
        ['Banana', 'full', 300],
    ],
    columns=[
        'Fruit',
        'Satisfaction',
        'Weight'
    ],
    index=[
        'Franz',
        'Gerhard',
        'Gerhard',
        'Hans',
        'Hans',
        'Hans',
    ],
)
df.Satisfaction = df.Satisfaction.astype('category')
df

Output:

	Fruit	Satisfaction	Weight
Franz	Apple	full	100
Gerhard	Orange	none	200
Gerhard	Pear	full	300
Hans	Pear	partial	100
Hans	Banana	full	400
Hans	Banana	full	300

Are you as excited as I am to learn the first few magic incantations of .aggregate()? Let’s move on and start working on the DataFrame.

3. `.aggregate()` Heaven

3.1. `count`

Using the count aggregation, we can count the amount of rows in a group by expression. This isn’t particularly exciting, but makes the following steps clearer.

First we turn towards our Pandas DataFrame and try to count the number of fruits that each person has consumed.

For this, we group the DataFrame by its index as the index contains the person names. To group by the DataFrame’s index, we can group by using .groupby(level=0). We use level=0 to tell Pandas that we want to group by the DataFrames index. We could use .groupby(df.index) instead, but this way we can leave it implicit and save us a bit of typing. See the docs for some more information on how you can invoke groupby().

df.groupby(level=0)

Output:

<pandas.core.groupby.DataFrameGroupBy object at 0x10b88ada0>

Calling .groupby() by itself doesn’t do much. We need to perform an aggregation on it to get a meaningful result. Let’s do the actual calculation now and see what we get. We call .aggregate('count') on the DataFrameGroupBy object.

df.groupby(level=0).aggregate('count')

Output:

	Fruit	Satisfaction	Weight
Franz	1	1	1
Gerhard	2	2	2
Hans	3	3	3

An informed reader may explain that you can invoke count() directly and that it gives the same result:

df.groupby(level=0).count()

Output:

	Fruit	Satisfaction	Weight
Franz	1	1	1
Gerhard	2	2	2
Hans	3	3	3

That’s true: just using .count() on a DataFrameGroupBy object is much easier to understand. Conversely, aggregate() lets you do something that you can’t achieve otherwise: it lets you perform many calculations at once and then formats the results using nested columns.

Pandas gives us a lot of freedom in how exactly we want the aggregates to look like. Let’s get to a slightly more complicated example to illustrate the true flexibility right away.

3.2. Aggregator functions

If we would like to find out what the most frequent value in a column is, we need to use a custom aggregator. Pandas doesn’t include a method for this out of the box, so we can either define a function or a lambda to give us the desired result.

In our case we would like to define a lambda to run this calculation:

most_frequent = lambda s: s.value_counts().idxmax()

We can try calling most_frequent on the whole DataFrame and check the result. We use .apply() in this case to apply a function to every column in a DataFrame.

df.apply(most_frequent).to_frame()

Output:

	0
Fruit	Banana
Satisfaction	full
Weight	100

Let’s put everything together and calculate the .aggregate().

df.groupby(level=0).aggregate(lambda s: s.value_counts().idxmax())

Output:

	Fruit	Satisfaction	Weight
Franz	Apple	full	100
Gerhard	Pear	none	300
Hans	Banana	full	400

3.3. Combining aggregates

Let us find out, what

the first and last fruit that each person has eaten is,
while also counting the total amount of fruits consumed, and
what the most frequently given satisfaction rating for each person is.

To do this, we first have to define an .aggregate() dictionary. It contains instructions on which calculations to perform on which column. It even lets you define custom aggregators using Python functions or lambdas.

The .aggregate() dictionary contains two entries corresponding to the two columns in the DataFrame. For the fruit column, we add 3 desired aggregates in the form of a list:

['first', 'last', 'count']

And for the satisfaction column, we add a named aggregator function by specifying a list containing one tuple with the most_frequent lambda that we have defined before:

[('most_frequent', most_frequent)]

Attaching a name to the aggregator is useful in our case, since it tells Pandas what to name the result column after calculating the aggregates. Otherwise, using lambda makes Pandas call the resulting column lambda as well.

We define the full dictionary as follows:

aggregate = {
    'Fruit': [
        'first',
        'last',
        'count',
    ],
    'Satisfaction': [
        ('most_frequent', most_frequent),
    ]
}

Let’s run the actual calculation then and see what the result looks like.

df.groupby(level=0).aggregate(aggregate)

Output:

	Fruit			Satisfaction
	`first`	`last`	`count`	`most_frequent`
Franz	Apple	Apple	1	full
Gerhard	Orange	Pear	2	none
Hans	Pear	Banana	3	full

We can see that Pandas formats and calculates the DataFrame’s columns exactly as we’ve defined in the dictionary used for our .aggregate() call.

4. Side note: Data types

4.1. Pandas `dtype`s

Note that when running aggregates, the result datatype for an aggregated column can be different from the source column. To get back to our count() example, observe the following data types for the source DataFrame:

df.dtypes.to_frame()

Output:

	0
Fruit	object
Satisfaction	category
Weight	int64

The Fruit column contains data of the type object, which is the way Pandas stores Python str (string) data in columns,
Satisfaction contains category data, as indicated before, and
Weight contains int64 data.

Now, observe one more time what happens when we retrieve the count .aggregate() on the same DataFrame.

df.groupby(level=0).aggregate('count')

Output:

	Fruit	Satisfaction	Weight
Franz	1	1	1
Gerhard	2	2	2
Hans	3	3	3

We look at the .dtypes attribute of our .aggregate().

df.groupby(level=0).aggregate('count').dtypes.to_frame()

Output:

	0
Fruit	int64
Satisfaction	int64
Weight	int64

This reveals that Pandas stores count as int64 unlike the original columns. The original columns had the data types object and category. Pandas returns a different data type after performing aggregates, depending on what the result of a calculation is.

With count data, integers such as int64 (a 64 bit signed integer) are the sensible choice for storing them. NumPy ndarray is the internal storage format used for most data in Pandas, except for indices.

Find more info on ndarray here. The reason why Pandas uses ndarray objects is that it’s a space and time efficient way of storing fixed length lists of numbers.

Let’s look at the underlying datatype of a Pandas column by accessing the .values attribute.

type(df.Fruit.values)

Output:

numpy.ndarray

Evaluating df.Fruit.values reveals that it has the data type object. You can see this under dtype in the Output.

df.Fruit.values

Output:

array(['Apple', 'Orange', 'Pear', 'Pear', 'Banana', 'Banana'], dtype=object)

NumPy and Pandas use object to store str (string) data. Unlike fixed-width integers, such as int64, NumPy can’t store string data efficiently inside a continuous ndarray.

The ndarray instance in this case only contains a collection of pointers to objects, just how a regular Python list is a list of pointers. The space efficiency characteristics of a ndarray instance with data type set to object are slightly better than just using list. We show this in the next subsection.

4.2. Memory profiling

You can verify that NumPy ndarrays store strings more efficiently than Python lists using a memory profiler. Luckily, we can use a PyPI package called pympler to measure how much memory Python objects are using. We import NumPy so that we can create NumPy arrays directly without requiring Pandas.

from pympler import asizeof
import numpy as np

We create an ndarray containing 10**6 strings hello.

numpy_hello = np.zeros(10 ** 6, dtype=object)
numpy_hello.fill('hello')
numpy_hello

Output:

array(['hello', 'hello', 'hello', ..., 'hello', 'hello', 'hello'], dtype=object)

We create the same array as a Python list that contains “hello” 10**6 times.

python_hello = ["hello" for _ in range(10 ** 6)]
# Print the first 3 items
python_hello[:3]

Output:

['hello', 'hello', 'hello']

Now for the evaluation: the following snippet prints the size of the NumPy ndarray in bytes using Pympler’s asizeof method:

asizeof.asizeof(numpy_hello)

Output:

To compare, we print the size of the Python list.

asizeof.asizeof(python_hello)

Output:

print("NumPy array size in relation to Python list size: {:2.2%}".format(
    asizeof.asizeof(numpy_hello) /
    asizeof.asizeof(python_hello)
))

Output:

NumPy array size in relation to Python list size: 91.98%

The NumPy array only takes 91.98% of the space required by the Python list, even though the data that they store is the same.

About Pympler

Note that Pympler prints accurate sizes by traversing the object and summing up attribute sizes for all descendant attributes. This is unlike sys.getsizeof, which doesn’t perform a deep traversal of an object and its attributes, especially for user defined classes.

You can see the difference here:

from sys import getsizeof

getsizeof(python_hello)

Output:

Now here the difference is minimal, but as soon as we nest objects even further, the difference between sys.getsizeof and asizeof becomes obvious:

getsizeof([[[]]])

Output:

asizeof.asizeof([[[]]])

Output:

As a bonus, we compare the previous two objects numpy_hello and python_hello to using a tuple() instead.

tuple_hello = tuple("hello" for _ in range(10 ** 6))
# Print the first 3 items
tuple_hello[:3]

Output:

('hello', 'hello', 'hello')

tuple_hello is smaller than the list object python_hello, but still bigger than numpy_hello:

asizeof.asizeof(tuple_hello)

Output:

From this we conclude that NumPy’s ndarray data structure is the most efficient way of storing fixed-size array data.

This concludes our short excursion on Python and NumPy memory usage.

5. More about aggregates

5.1. `nunique`

In Pandas, nunique counts the number of unique values in a column. We can apply this to the whole DataFrame and get a count of the unique fruit and satisfaction values:

df.nunique().to_frame()

Output:

	0
Fruit	4
Satisfaction	3
Weight	4

Furthermore, the method can also be applied on a group by object to retrieve the unique number of values per group. We see below the number of unique fruits and satisfactions that have been assigned to each person.

df.groupby(level=0).nunique()

Output:

	Fruit	Satisfaction	Weight
Franz	1	1	1
Gerhard	2	2	2
Hans	2	2	3

Now, nunique is also available in aggregates. The reason why we would use nunique in aggregates is if we want to retrieve many results for one groupby() expression at the same time. This can not only save us some typing, but can also save us some computational time. Here, a group is only created once and each operation is then applied to it one after another.

df.groupby(level=0).agg('nunique')

Output:

	Fruit	Satisfaction	Weight
Franz	1	1	1
Gerhard	2	2	2
Hans	2	2	3

5.2. Using `.aggregate()` Results

Having run this, we now know that Franz has only consumed one kind of fruit. Hans is the champion of trying out many types of fruits. We can then use this value to compare it to the total count of fruits consumed. This lets us calculate a variety score for each person. We define a variety of 100 % as a fruit consumption pattern in which a new fruit is tried every time.

fruit_counts = df.groupby(level=0).Fruit.aggregate(['nunique', 'count'])
fruit_counts

Output:

	`nunique`	`count`
Franz	1	1
Gerhard	2	2
Hans	2	3

We display the variety counts with a call to the .apply method and format the resulting floats using a Python format string. Here’s how to do this:

(
    fruit_counts['nunique'] / fruit_counts['count']
).apply(
    lambda v: "{:.2%}".format(v)
).to_frame('Variety')

Output:

	Variety
Franz	100.00%
Gerhard	100.00%
Hans	66.67%

5.3. All built-in aggregates

Here is a list of all the built-in aggregates that I was able to find in Pandas.

These are the aggregates that work with all data types. To save some space we limit ourselves to the Fruit column.

df.groupby(level=0).Fruit.aggregate([
     'count',
     'min',
     'max',
     'first',
     'last',
     'nunique',
]).applymap(
    lambda v: v if isinstance(v, str) else "{:d}".format(v)
)

Output:

	`count`	`min`	`max`	`first`	`last`	`nunique`
Franz	1	Apple	Apple	Apple	Apple	1
Gerhard	2	Orange	Pear	Orange	Pear	2
Hans	3	Banana	Pear	Pear	Banana	2

Here are the data types that work with numerical data types.

df.groupby(level=0).aggregate([
    'mean',
    'std',
    'var',
    'median',
    'prod',
    'sum',
    'mad',
    'sem',
    'skew',
    'quantile',  # 50 % quantile
]).applymap(
    lambda v: v if isinstance(v, str) else "{:.2f}".format(v)
)

Output:

	Weight
	`mean`	`std`	`var`	`median`	`prod`	`sum`	`mad`	`sem`	`skew`	`quantile`
Franz	100.00	`nan`	`nan`	100.00	100.00	100.00	0.00	`nan`	`nan`	100.00
Gerhard	250.00	70.71	5000.00	250.00	60000.00	500.00	50.00	50.00	`nan`	250.00
Hans	266.67	152.75	23333.33	300.00	12000000.00	800.00	111.11	88.19	-0.94	300.00

6. `.Aggregate()` performance

Let’s put the claim to the test that .aggregate() calls are faster than .aggregate()-ing directly on a group by, and find out in which circumstances this statement holds. We would first like to find out which is faster: many .aggregate()s in one .aggregate() call, or separate .aggregate() calls applied to a new group by object each time? We choose to run .aggregate() over min and max on our DataFrame.

%%timeit
df.groupby(df.index).min()
df.groupby(df.index).max()

Output:

26.1 ms ± 819 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
df.groupby(df.index).aggregate(['min', 'max'])

Output:

9.02 ms ± 75.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

We see that calling .aggregate() performs much faster when many .aggregate() values are needed compared to aggregating twice on a .groupby(). Furthermore, we can observe in the next two cells that there is hardly any difference when only one .aggregate() value is required. In this case, the shorter amount of code should win the contest, since it simply requires less typing.

%%timeit
df.groupby(df.index).min()

Output:

13.6 ms ± 945 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
df.groupby(df.index).aggregate('min')

Output:

16.5 ms ± 4.14 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

7. Conclusion

I hope I was able to show a few use cases for Pandas aggregates. It’s certainly nice to be able to save some typing and have better performance when dealing with many .aggregate() calculations, especially in a group by setting.

I certainly could not do without .aggregate(), as it saves me a lot of time when typing out IPython notebooks.

Aggregating in Pandas

1. Introduction

2. Setup

3. `.aggregate()` Heaven

3.1. `count`

3.2. Aggregator functions

3.3. Combining aggregates

4. Side note: Data types

4.1. Pandas `dtype`s

4.2. Memory profiling

5. More about aggregates

5.1. `nunique`

5.2. Using `.aggregate()` Results

5.3. All built-in aggregates

6. `.Aggregate()` performance

7. Conclusion

Tags

Aggregating in Pandas

1. Introduction

2. Setup

3. .aggregate() Heaven

3.1. count

3.2. Aggregator functions

3.3. Combining aggregates

4. Side note: Data types

4.1. Pandas dtypes

4.2. Memory profiling

5. More about aggregates

5.1. nunique

5.2. Using .aggregate() Results

5.3. All built-in aggregates

6. .Aggregate() performance

7. Conclusion

Tags

3. `.aggregate()` Heaven

3.1. `count`

4.1. Pandas `dtype`s

5.1. `nunique`

5.2. Using `.aggregate()` Results

6. `.Aggregate()` performance