sum()
, mean()
, median()
, min()
, and max()
, in which a single number gives insight into the nature of a potentially large dataset.
In this section, we'll explore aggregations in Pandas, from simple operations akin to what we've seen on NumPy arrays, to more sophisticated operations based on the concept of a groupby
.display
magic function that we've seen in previous sections:import numpy as np
import pandas as pd
class display(object):
"""Display HTML representation of multiple objects"""
template = """<div style="float: left; padding: 10px;">
<p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
</div>"""
def __init__(self, *args):
self.args = args
def _repr_html_(self):
return '\n'.join(self.template.format(a, eval(a)._repr_html_())
for a in self.args)
def __repr__(self):
return '\n\n'.join(a + '\n' + repr(eval(a))
for a in self.args)
import seaborn as sns
planets = sns.load_dataset('planets')
planets.shape
planets.head()
Series
the aggregates return a single value:rng = np.random.RandomState(42)
ser = pd.Series(rng.rand(5))
ser
ser.sum()
ser.mean()
DataFrame
, by default the aggregates return results within each column:df = pd.DataFrame({'A': rng.rand(5),
'B': rng.rand(5)})
df
df.mean()
axis
argument, you can instead aggregate within each row:df.mean(axis='columns')
Series
and DataFrame
s include all of the common aggregates mentioned in Aggregations: Min, Max, and Everything In Between; in addition, there is a convenience method describe()
that computes several common aggregates for each column and returns the result.
Let's use this on the Planets data, for now dropping rows with missing values:planets.dropna().describe()
year
column that although exoplanets were discovered as far back as 1989, half of all known expolanets were not discovered until 2010 or after.
This is largely thanks to the Kepler mission, which is a space-based telescope specifically designed for finding eclipsing planets around other stars.Aggregation | Description |
---|---|
count() | Total number of items |
first() , last() | First and last item |
mean() , median() | Mean and median |
min() , max() | Minimum and maximum |
std() , var() | Standard deviation and variance |
mad() | Mean absolute deviation |
prod() | Product of all items |
sum() | Sum of all items |
DataFrame
and Series
objects.groupby
operation, which allows you to quickly and efficiently compute aggregates on subsets of data.groupby
operation.
The name "group by" comes from a command in the SQL database language, but it is perhaps more illuminative to think of it in the terms first coined by Hadley Wickham of Rstats fame: split, apply, combine.groupby
accomplishes:DataFrame
depending on the value of the specified key.GroupBy
can (often) do this in a single pass over the data, updating the sum, mean, count, min, or other aggregate for each group along the way.
The power of the GroupBy
is that it abstracts away these steps: the user need not think about how the computation is done under the hood, but rather thinks about the operation as a whole.DataFrame
:df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
'data': range(6)}, columns=['key', 'data'])
df
groupby()
method of DataFrame
s, passing the name of the desired key column:df.groupby('key')
DataFrame
s, but a DataFrameGroupBy
object.
This object is where the magic is: you can think of it as a special view of the DataFrame
, which is poised to dig into the groups but does no actual computation until the aggregation is applied.
This "lazy evaluation" approach means that common aggregates can be implemented very efficiently in a way that is almost transparent to the user.DataFrameGroupBy
object, which will perform the appropriate apply/combine steps to produce the desired result:df.groupby('key').sum()
sum()
method is just one possibility here; you can apply virtually any common Pandas or NumPy aggregation function, as well as virtually any valid DataFrame
operation, as we will see in the following discussion.GroupBy
object is a very flexible abstraction.
In many ways, you can simply treat it as if it's a collection of DataFrame
s, and it does the difficult things under the hood. Let's see some examples using the Planets data.GroupBy
are aggregate, filter, transform, and apply.
We'll discuss each of these more fully in "Aggregate, Filter, Transform, Apply", but before that let's introduce some of the other functionality that can be used with the basic GroupBy
operation.GroupBy
object supports column indexing in the same way as the DataFrame
, and returns a modified GroupBy
object.
For example:planets.groupby('method')
planets.groupby('method')['orbital_period']
Series
group from the original DataFrame
group by reference to its column name.
As with the GroupBy
object, no computation is done until we call some aggregate on the object:planets.groupby('method')['orbital_period'].median()
GroupBy
object supports direct iteration over the groups, returning each group as a Series
or DataFrame
:for (method, group) in planets.groupby('method'):
print("{0:30s} shape={1}".format(method, group.shape))
apply
functionality, which we will discuss momentarily.GroupBy
object will be passed through and called on the groups, whether they are DataFrame
or Series
objects.
For example, you can use the describe()
method of DataFrame
s to perform a set of aggregations that describe each group in the data:planets.groupby('method')['year'].describe().unstack()
GroupBy
and returned.
Again, any valid DataFrame
/Series
method can be used on the corresponding GroupBy
object, which allows for some very flexible and powerful operations!GroupBy
objects have aggregate()
, filter()
, transform()
, and apply()
methods that efficiently implement a variety of useful operations before combining the grouped data.DataFrame
:rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
'data1': range(6),
'data2': rng.randint(0, 10, 6)},
columns = ['key', 'data1', 'data2'])
df
GroupBy
aggregations with sum()
, median()
, and the like, but the aggregate()
method allows for even more flexibility.
It can take a string, a function, or a list thereof, and compute all the aggregates at once.
Here is a quick example combining all these:df.groupby('key').aggregate(['min', np.median, max])
df.groupby('key').aggregate({'data1': 'min',
'data2': 'max'})
def filter_func(x):
return x['data2'].std() > 4
display('df', "df.groupby('key').std()", "df.groupby('key').filter(filter_func)")
df.groupby('key').transform(lambda x: x - x.mean())
apply()
method lets you apply an arbitrary function to the group results.
The function should take a DataFrame
, and return either a Pandas object (e.g., DataFrame
, Series
) or a scalar; the combine operation will be tailored to the type of output returned.apply()
that normalizes the first column by the sum of the second:def norm_by_data2(x):
# x is a DataFrame of group values
x['data1'] /= x['data2'].sum()
return x
display('df', "df.groupby('key').apply(norm_by_data2)")
apply()
within a GroupBy
is quite flexible: the only criterion is that the function takes a DataFrame
and returns a Pandas object or scalar; what you do in the middle is up to you!DataFrame
on a single column name.
This is just one of many options by which the groups can be defined, and we'll go through some other options for group specification here.DataFrame
. For example:L = [0, 1, 0, 1, 2, 0]
display('df', 'df.groupby(L).sum()')
df.groupby('key')
from before:display('df', "df.groupby(df['key']).sum()")
df2 = df.set_index('key')
mapping = {'A': 'vowel', 'B': 'consonant', 'C': 'consonant'}
display('df2', 'df2.groupby(mapping).sum()')
display('df2', 'df2.groupby(str.lower).mean()')
df2.groupby([str.lower, mapping]).mean()
decade = 10 * (planets['year'] // 10)
decade = decade.astype(str) + 's'
decade.name = 'decade'
planets.groupby(['method', decade])['number'].sum().unstack().fillna(0)