Series and DataFrame objects, respectively.
Often it is useful to go beyond this and store higher-dimensional data–that is, data indexed by more than one or two keys.
While Pandas does provide Panel and Panel4D objects that natively handle three-dimensional and four-dimensional data (see Aside: Panel Data), a far more common pattern in practice is to make use of hierarchical indexing (also known as multi-indexing) to incorporate multiple index levels within a single index.
In this way, higher-dimensional data can be compactly represented within the familiar one-dimensional Series and two-dimensional DataFrame objects.MultiIndex objects, considerations when indexing, slicing, and computing statistics across multiply indexed data, and useful routines for converting between simple and hierarchically indexed representations of your data.
import pandas as pd
import numpy as npSeries.
For concreteness, we will consider a series of data where each point has a character and numerical key.
index = [('California', 2000), ('California', 2010),
('New York', 2000), ('New York', 2010),
('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
18976457, 19378102,
20851820, 25145561]
pop = pd.Series(populations, index=index)
pop
pop[('California', 2010):('Texas', 2000)]
pop[[i for i in pop.index if i[1] == 2010]]MultiIndex type gives us the type of operations we wish to have.
We can create a multi-index from the tuples as follows:
index = pd.MultiIndex.from_tuples(index)
indexMultiIndex contains multiple levels of indexing–in this case, the state names and the years, as well as multiple labels for each data point which encode these levels.MultiIndex, we see the hierarchical representation of the data:
pop = pop.reindex(index)
popSeries representation show the multiple index values, while the third column shows the data.
Notice that some entries are missing in the first column: in this multi-index representation, any blank entry indicates the same value as the line above it.
pop[:, 2010]DataFrame with index and column labels.
In fact, Pandas is built with this equivalence in mind. The unstack() method will quickly convert a multiply indexed Series into a conventionally indexed DataFrame:
pop_df = pop.unstack()
pop_dfstack() method provides the opposite operation:
pop_df.stack()Series, we can also use it to represent data of three or more dimensions in a Series or DataFrame.
Each extra level in a multi-index represents an extra dimension of data; taking advantage of this property gives us much more flexibility in the types of data we can represent. Concretely, we might want to add another column of demographic data for each state at each year (say, population under 18) ; with a MultiIndex this is as easy as adding another column to the DataFrame:
pop_df = pd.DataFrame({'total': pop,
'under18': [9267089, 9284094,
4687374, 4318033,
5906301, 6879014]})
pop_df
f_u18 = pop_df['under18'] / pop_df['total']
f_u18.unstack()Series or DataFrame is to simply pass a list of two or more index arrays to the constructor. For example:
df = pd.DataFrame(np.random.rand(4, 2),
index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns=['data1', 'data2'])
dfMultiIndex is done in the background.MultiIndex by default:
data = {('California', 2000): 33871648,
('California', 2010): 37253956,
('Texas', 2000): 20851820,
('Texas', 2010): 25145561,
('New York', 2000): 18976457,
('New York', 2010): 19378102}
pd.Series(data)MultiIndex; we'll see a couple of these methods here.pd.MultiIndex.
For example, as we did before, you can construct the MultiIndex from a simple list of arrays giving the index values within each level:
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])
pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])
pd.MultiIndex.from_product([['a', 'b'], [1, 2]])MultiIndex directly using its internal encoding by passing levels (a list of lists containing available index values for each level) and labels (a list of lists that reference these labels):
pd.MultiIndex(levels=[['a', 'b'], [1, 2]],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])index argument when creating a Series or Dataframe, or be passed to the reindex method of an existing Series or DataFrame.MultiIndex.
This can be accomplished by passing the names argument to any of the above MultiIndex constructors, or by setting the names attribute of the index after the fact:
pop.index.names = ['state', 'year']
popDataFrame, the rows and columns are completely symmetric, and just as the rows can have multiple levels of indices, the columns can have multiple levels as well.
Consider the following, which is a mock-up of some (somewhat realistic) medical data:
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
names=['subject', 'type'])
# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37
# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_dataDataFrame containing just that person's information:
health_data['Guido']MultiIndex is designed to be intuitive, and it helps if you think about the indices as added dimensions.
We'll first look at indexing multiply indexed Series, and then multiply-indexed DataFrames.Series of state populations we saw earlier:
pop
pop['California', 2000]MultiIndex also supports partial indexing, or indexing just one of the levels in the index.
The result is another Series, with the lower-level indices maintained:
pop['California']MultiIndex is sorted (see discussion in Sorted and Unsorted Indices):
pop.loc['California':'New York']
pop[:, 2000]
pop[pop > 22000000]
pop[['California', 'Texas']]DataFrame behaves in a similar manner.
Consider our toy medical DataFrame from before:
health_dataDataFrame, and the syntax used for multiply indexed Series applies to the columns.
For example, we can recover Guido's heart rate data with a simple operation:
health_data['Guido', 'HR']loc, iloc, and ix indexers introduced in Data Indexing and Selection. For example:
health_data.iloc[:2, :2]loc or iloc can be passed a tuple of multiple indices. For example:
health_data.loc[:, ('Bob', 'HR')]
health_data.loc[(:, 1), (:, 'HR')]slice() function, but a better way in this context is to use an IndexSlice object, which Pandas provides for precisely this situation.
For example:
idx = pd.IndexSlice
health_data.loc[idx[:, 1], idx[:, 'HR']]Series and DataFrames, and as with many tools in this book the best way to become familiar with them is to try them out!stack() and unstack() methods, but there are many more ways to finely control the rearrangement of data between hierarchical indices and columns, and we'll explore them here.MultiIndex slicing operations will fail if the index is not sorted.
Let's take a look at this here.
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
data = pd.Series(np.random.rand(6), index=index)
data.index.names = ['char', 'int']
data
try:
data['a':'b']
except KeyError as e:
print(type(e))
print(e)MultiIndex to be in sorted (i.e., lexographical) order.
Pandas provides a number of convenience routines to perform this type of sorting; examples are the sort_index() and sortlevel() methods of the DataFrame.
We'll use the simplest, sort_index(), here:
data = data.sort_index()
data
data['a':'b']
pop.unstack(level=0)
pop.unstack(level=1)unstack() is stack(), which here can be used to recover the original series:
pop.unstack().stack()reset_index method.
Calling this on the population dictionary will result in a DataFrame with a state and year column holding the information that was formerly in the index.
For clarity, we can optionally specify the name of the data for the column representation:
pop_flat = pop.reset_index(name='population')
pop_flatMultiIndex from the column values.
This can be done with the set_index method of the DataFrame, which returns a multiply indexed DataFrame:
pop_flat.set_index(['state', 'year'])mean(), sum(), and max().
For hierarchically indexed data, these can be passed a level parameter that controls which subset of the data the aggregate is computed on.
health_data