Series
and DataFrame
objects, respectively.
Often it is useful to go beyond this and store higher-dimensional data–that is, data indexed by more than one or two keys.
While Pandas does provide Panel
and Panel4D
objects that natively handle three-dimensional and four-dimensional data (see Aside: Panel Data), a far more common pattern in practice is to make use of hierarchical indexing (also known as multi-indexing) to incorporate multiple index levels within a single index.
In this way, higher-dimensional data can be compactly represented within the familiar one-dimensional Series
and two-dimensional DataFrame
objects.MultiIndex
objects, considerations when indexing, slicing, and computing statistics across multiply indexed data, and useful routines for converting between simple and hierarchically indexed representations of your data.import pandas as pd
import numpy as np
Series
.
For concreteness, we will consider a series of data where each point has a character and numerical key.index = [('California', 2000), ('California', 2010),
('New York', 2000), ('New York', 2010),
('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
18976457, 19378102,
20851820, 25145561]
pop = pd.Series(populations, index=index)
pop
pop[('California', 2010):('Texas', 2000)]
pop[[i for i in pop.index if i[1] == 2010]]
MultiIndex
type gives us the type of operations we wish to have.
We can create a multi-index from the tuples as follows:index = pd.MultiIndex.from_tuples(index)
index
MultiIndex
contains multiple levels of indexing–in this case, the state names and the years, as well as multiple labels for each data point which encode these levels.MultiIndex
, we see the hierarchical representation of the data:pop = pop.reindex(index)
pop
Series
representation show the multiple index values, while the third column shows the data.
Notice that some entries are missing in the first column: in this multi-index representation, any blank entry indicates the same value as the line above it.pop[:, 2010]
DataFrame
with index and column labels.
In fact, Pandas is built with this equivalence in mind. The unstack()
method will quickly convert a multiply indexed Series
into a conventionally indexed DataFrame
:pop_df = pop.unstack()
pop_df
stack()
method provides the opposite operation:pop_df.stack()
Series
, we can also use it to represent data of three or more dimensions in a Series
or DataFrame
.
Each extra level in a multi-index represents an extra dimension of data; taking advantage of this property gives us much more flexibility in the types of data we can represent. Concretely, we might want to add another column of demographic data for each state at each year (say, population under 18) ; with a MultiIndex
this is as easy as adding another column to the DataFrame
:pop_df = pd.DataFrame({'total': pop,
'under18': [9267089, 9284094,
4687374, 4318033,
5906301, 6879014]})
pop_df
f_u18 = pop_df['under18'] / pop_df['total']
f_u18.unstack()
Series
or DataFrame
is to simply pass a list of two or more index arrays to the constructor. For example:df = pd.DataFrame(np.random.rand(4, 2),
index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns=['data1', 'data2'])
df
MultiIndex
is done in the background.MultiIndex
by default:data = {('California', 2000): 33871648,
('California', 2010): 37253956,
('Texas', 2000): 20851820,
('Texas', 2010): 25145561,
('New York', 2000): 18976457,
('New York', 2010): 19378102}
pd.Series(data)
MultiIndex
; we'll see a couple of these methods here.pd.MultiIndex
.
For example, as we did before, you can construct the MultiIndex
from a simple list of arrays giving the index values within each level:pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])
pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])
pd.MultiIndex.from_product([['a', 'b'], [1, 2]])
MultiIndex
directly using its internal encoding by passing levels
(a list of lists containing available index values for each level) and labels
(a list of lists that reference these labels):pd.MultiIndex(levels=[['a', 'b'], [1, 2]],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
index
argument when creating a Series
or Dataframe
, or be passed to the reindex
method of an existing Series
or DataFrame
.MultiIndex
.
This can be accomplished by passing the names
argument to any of the above MultiIndex
constructors, or by setting the names
attribute of the index after the fact:pop.index.names = ['state', 'year']
pop
DataFrame
, the rows and columns are completely symmetric, and just as the rows can have multiple levels of indices, the columns can have multiple levels as well.
Consider the following, which is a mock-up of some (somewhat realistic) medical data:# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
names=['subject', 'type'])
# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37
# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data
DataFrame
containing just that person's information:health_data['Guido']
MultiIndex
is designed to be intuitive, and it helps if you think about the indices as added dimensions.
We'll first look at indexing multiply indexed Series
, and then multiply-indexed DataFrame
s.Series
of state populations we saw earlier:pop
pop['California', 2000]
MultiIndex
also supports partial indexing, or indexing just one of the levels in the index.
The result is another Series
, with the lower-level indices maintained:pop['California']
MultiIndex
is sorted (see discussion in Sorted and Unsorted Indices):pop.loc['California':'New York']
pop[:, 2000]
pop[pop > 22000000]
pop[['California', 'Texas']]
DataFrame
behaves in a similar manner.
Consider our toy medical DataFrame
from before:health_data
DataFrame
, and the syntax used for multiply indexed Series
applies to the columns.
For example, we can recover Guido's heart rate data with a simple operation:health_data['Guido', 'HR']
loc
, iloc
, and ix
indexers introduced in Data Indexing and Selection. For example:health_data.iloc[:2, :2]
loc
or iloc
can be passed a tuple of multiple indices. For example:health_data.loc[:, ('Bob', 'HR')]
health_data.loc[(:, 1), (:, 'HR')]
slice()
function, but a better way in this context is to use an IndexSlice
object, which Pandas provides for precisely this situation.
For example:idx = pd.IndexSlice
health_data.loc[idx[:, 1], idx[:, 'HR']]
Series
and DataFrame
s, and as with many tools in this book the best way to become familiar with them is to try them out!stack()
and unstack()
methods, but there are many more ways to finely control the rearrangement of data between hierarchical indices and columns, and we'll explore them here.MultiIndex
slicing operations will fail if the index is not sorted.
Let's take a look at this here.index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
data = pd.Series(np.random.rand(6), index=index)
data.index.names = ['char', 'int']
data
try:
data['a':'b']
except KeyError as e:
print(type(e))
print(e)
MultiIndex
to be in sorted (i.e., lexographical) order.
Pandas provides a number of convenience routines to perform this type of sorting; examples are the sort_index()
and sortlevel()
methods of the DataFrame
.
We'll use the simplest, sort_index()
, here:data = data.sort_index()
data
data['a':'b']
pop.unstack(level=0)
pop.unstack(level=1)
unstack()
is stack()
, which here can be used to recover the original series:pop.unstack().stack()
reset_index
method.
Calling this on the population dictionary will result in a DataFrame
with a state and year column holding the information that was formerly in the index.
For clarity, we can optionally specify the name of the data for the column representation:pop_flat = pop.reset_index(name='population')
pop_flat
MultiIndex
from the column values.
This can be done with the set_index
method of the DataFrame
, which returns a multiply indexed DataFrame
:pop_flat.set_index(['state', 'year'])
mean()
, sum()
, and max()
.
For hierarchically indexed data, these can be passed a level
parameter that controls which subset of the data the aggregate is computed on.health_data