Series
and DataFrame
s are built with this type of operation in mind, and Pandas includes functions and methods that make this sort of data wrangling fast and straightforward.Series
and DataFrame
s with the pd.concat
function; later we'll dive into more sophisticated in-memory merges and joins implemented in Pandas.import pandas as pd
import numpy as np
DataFrame
of a particular form that will be useful below:def make_df(cols, ind):
"""Quickly make a DataFrame"""
data = {c: [str(c) + str(i) for i in ind]
for c in cols}
return pd.DataFrame(data, ind)
# example DataFrame
make_df('ABC', range(3))
DataFrame
s side by side. The code makes use of the special _repr_html_
method, which IPython uses to implement its rich object display:class display(object):
"""Display HTML representation of multiple objects"""
template = """<div style="float: left; padding: 10px;">
<p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
</div>"""
def __init__(self, *args):
self.args = args
def _repr_html_(self):
return '\n'.join(self.template.format(a, eval(a)._repr_html_())
for a in self.args)
def __repr__(self):
return '\n\n'.join(a + '\n' + repr(eval(a))
for a in self.args)
Series
and DataFrame
objects is very similar to concatenation of Numpy arrays, which can be done via the np.concatenate
function as discussed in The Basics of NumPy Arrays.
Recall that with it, you can combine the contents of two or more arrays into a single array:x = [1, 2, 3]
y = [4, 5, 6]
z = [7, 8, 9]
np.concatenate([x, y, z])
axis
keyword that allows you to specify the axis along which the result will be concatenated:x = [[1, 2],
[3, 4]]
np.concatenate([x, x], axis=1)
pd.concat
pd.concat()
, which has a similar syntax to np.concatenate
but contains a number of options that we'll discuss momentarily:# Signature in Pandas v0.18
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
keys=None, levels=None, names=None, verify_integrity=False,
copy=True)
pd.concat()
can be used for a simple concatenation of Series
or DataFrame
objects, just as np.concatenate()
can be used for simple concatenations of arrays:ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
pd.concat([ser1, ser2])
DataFrame
s:df1 = make_df('AB', [1, 2])
df2 = make_df('AB', [3, 4])
display('df1', 'df2', 'pd.concat([df1, df2])')
DataFrame
(i.e., axis=0
).
Like np.concatenate
, pd.concat
allows specification of an axis along which concatenation will take place.
Consider the following example:df3 = make_df('AB', [0, 1])
df4 = make_df('CD', [0, 1])
display('df3', 'df4', "pd.concat([df3, df4], axis='col')")
axis=1
; here we've used the more intuitive axis='col'
.np.concatenate
and pd.concat
is that Pandas concatenation preserves indices, even if the result will have duplicate indices!
Consider this simple example:x = make_df('AB', [0, 1])
y = make_df('AB', [2, 3])
y.index = x.index # make duplicate indices!
display('x', 'y', 'pd.concat([x, y])')
DataFrame
s, the outcome is often undesirable.
pd.concat()
gives us a few ways to handle it.pd.concat()
do not overlap, you can specify the verify_integrity
flag.
With this set to True, the concatenation will raise an exception if there are duplicate indices.
Here is an example, where for clarity we'll catch and print the error message:try:
pd.concat([x, y], verify_integrity=True)
except ValueError as e:
print("ValueError:", e)
ignore_index
flag.
With this set to true, the concatenation will create a new integer index for the resulting Series
:display('x', 'y', 'pd.concat([x, y], ignore_index=True)')
keys
option to specify a label for the data sources; the result will be a hierarchically indexed series containing the data:display('x', 'y', "pd.concat([x, y], keys=['x', 'y'])")
DataFrame
, and we can use the tools discussed in Hierarchical Indexing to transform this data into the representation we're interested in.DataFrame
s with shared column names.
In practice, data from different sources might have different sets of column names, and pd.concat
offers several options in this case.
Consider the concatenation of the following two DataFrame
s, which have some (but not all!) columns in common:df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3, 4])
display('df5', 'df6', 'pd.concat([df5, df6])')
join
and join_axes
parameters of the concatenate function.
By default, the join is a union of the input columns (join='outer'
), but we can change this to an intersection of the columns using join='inner'
:display('df5', 'df6',
"pd.concat([df5, df6], join='inner')")
join_axes
argument, which takes a list of index objects.
Here we'll specify that the returned columns should be the same as those of the first input:display('df5', 'df6',
"pd.concat([df5, df6], join_axes=[df5.columns])")
pd.concat
function allows a wide range of possible behaviors when joining two datasets; keep these in mind as you use these tools for your own data.append()
methodSeries
and DataFrame
objects have an append
method that can accomplish the same thing in fewer keystrokes.
For example, rather than calling pd.concat([df1, df2])
, you can simply call df1.append(df2)
:display('df1', 'df2', 'df1.append(df2)')