arr[2, 1]
), slicing (e.g., arr[:, 1:5]
), masking (e.g., arr[arr > 0]
), fancy indexing (e.g., arr[0, [1, 5]]
), and combinations thereof (e.g., arr[:, [1, 5]]
).
Here we'll look at similar means of accessing and modifying values in Pandas Series
and DataFrame
objects.
If you have used the NumPy patterns, the corresponding patterns in Pandas will feel very familiar, though there are a few quirks to be aware of.Series
object, and then move on to the more complicated two-dimesnional DataFrame
object.Series
object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary.
If we keep these two overlapping analogies in mind, it will help us to understand the patterns of data indexing and selection in these arrays.Series
object provides a mapping from a collection of keys to a collection of values:import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd'])
data
data['b']
'a' in data
data.keys()
list(data.items())
Series
objects can even be modified with a dictionary-like syntax.
Just as you can extend a dictionary by assigning to a new key, you can extend a Series
by assigning to a new index value:data['e'] = 1.25
data
Series
builds on this dictionary-like interface and provides array-style item selection via the same basic mechanisms as NumPy arrays – that is, slices, masking, and fancy indexing.
Examples of these are as follows:# slicing by explicit index
data['a':'c']
# slicing by implicit integer index
data[0:2]
# masking
data[(data > 0.3) & (data < 0.8)]
# fancy indexing
data[['a', 'e']]
data['a':'c']
), the final index is included in the slice, while when slicing with an implicit index (i.e., data[0:2]
), the final index is excluded from the slice.Series
has an explicit integer index, an indexing operation such as data[1]
will use the explicit indices, while a slicing operation like data[1:3]
will use the implicit Python-style index.data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data
# explicit index when indexing
data[1]
# implicit index when slicing
data[1:3]
Series
.loc
attribute allows indexing and slicing that always references the explicit index:data.loc[1]
data.loc[1:3]
iloc
attribute allows indexing and slicing that always references the implicit Python-style index:data.iloc[1]
data.iloc[1:3]
ix
, is a hybrid of the two, and for Series
objects is equivalent to standard []
-based indexing.
The purpose of the ix
indexer will become more apparent in the context of DataFrame
objects, which we will discuss in a moment.loc
and iloc
make them very useful in maintaining clean and readable code; especially in the case of integer indexes, I recommend using these both to make code easier to read and understand, and to prevent subtle bugs due to the mixed indexing/slicing convention.DataFrame
acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of Series
structures sharing the same index.
These analogies can be helpful to keep in mind as we explore data selection within this structure.DataFrame
as a dictionary of related Series
objects.
Let's return to our example of areas and populations of states:area = pd.Series({'California': 423967, 'Texas': 695662,
'New York': 141297, 'Florida': 170312,
'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
'New York': 19651127, 'Florida': 19552860,
'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data
Series
that make up the columns of the DataFrame
can be accessed via dictionary-style indexing of the column name:data['area']
data.area
data.area is data['area']
DataFrame
, this attribute-style access is not possible.
For example, the DataFrame
has a pop()
method, so data.pop
will point to this rather than the "pop"
column:data.pop is data['pop']
data['pop'] = z
rather than data.pop = z
).Series
objects discussed earlier, this dictionary-style syntax can also be used to modify the object, in this case adding a new column:data['density'] = data['pop'] / data['area']
data
Series
objects; we'll dig into this further in Operating on Data in Pandas.DataFrame
as an enhanced two-dimensional array.
We can examine the raw underlying data array using the values
attribute:data.values
DataFrame
itself.
For example, we can transpose the full DataFrame
to swap rows and columns:data.T
DataFrame
objects, however, it is clear that the dictionary-style indexing of columns precludes our ability to simply treat it as a NumPy array.
In particular, passing a single index to an array accesses a row:data.values[0]
DataFrame
accesses a column:data['area']