Series and DataFrames are built with this type of operation in mind, and Pandas includes functions and methods that make this sort of data wrangling fast and straightforward.Series and DataFrames with the pd.concat function; later we'll dive into more sophisticated in-memory merges and joins implemented in Pandas.
import pandas as pd
import numpy as npDataFrame of a particular form that will be useful below:
def make_df(cols, ind):
"""Quickly make a DataFrame"""
data = {c: [str(c) + str(i) for i in ind]
for c in cols}
return pd.DataFrame(data, ind)
# example DataFrame
make_df('ABC', range(3))DataFrames side by side. The code makes use of the special _repr_html_ method, which IPython uses to implement its rich object display:
class display(object):
"""Display HTML representation of multiple objects"""
template = """<div style="float: left; padding: 10px;">
<p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
</div>"""
def __init__(self, *args):
self.args = args
def _repr_html_(self):
return '\n'.join(self.template.format(a, eval(a)._repr_html_())
for a in self.args)
def __repr__(self):
return '\n\n'.join(a + '\n' + repr(eval(a))
for a in self.args)Series and DataFrame objects is very similar to concatenation of Numpy arrays, which can be done via the np.concatenate function as discussed in The Basics of NumPy Arrays.
Recall that with it, you can combine the contents of two or more arrays into a single array:
x = [1, 2, 3]
y = [4, 5, 6]
z = [7, 8, 9]
np.concatenate([x, y, z])axis keyword that allows you to specify the axis along which the result will be concatenated:
x = [[1, 2],
[3, 4]]
np.concatenate([x, x], axis=1)pd.concatpd.concat(), which has a similar syntax to np.concatenate but contains a number of options that we'll discuss momentarily:
# Signature in Pandas v0.18
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
keys=None, levels=None, names=None, verify_integrity=False,
copy=True)pd.concat() can be used for a simple concatenation of Series or DataFrame objects, just as np.concatenate() can be used for simple concatenations of arrays:
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
pd.concat([ser1, ser2])DataFrames:
df1 = make_df('AB', [1, 2])
df2 = make_df('AB', [3, 4])
display('df1', 'df2', 'pd.concat([df1, df2])')DataFrame (i.e., axis=0).
Like np.concatenate, pd.concat allows specification of an axis along which concatenation will take place.
Consider the following example:
df3 = make_df('AB', [0, 1])
df4 = make_df('CD', [0, 1])
display('df3', 'df4', "pd.concat([df3, df4], axis='col')")axis=1; here we've used the more intuitive axis='col'.np.concatenate and pd.concat is that Pandas concatenation preserves indices, even if the result will have duplicate indices!
Consider this simple example:
x = make_df('AB', [0, 1])
y = make_df('AB', [2, 3])
y.index = x.index # make duplicate indices!
display('x', 'y', 'pd.concat([x, y])')DataFrames, the outcome is often undesirable.
pd.concat() gives us a few ways to handle it.pd.concat() do not overlap, you can specify the verify_integrity flag.
With this set to True, the concatenation will raise an exception if there are duplicate indices.
Here is an example, where for clarity we'll catch and print the error message:
try:
pd.concat([x, y], verify_integrity=True)
except ValueError as e:
print("ValueError:", e)ignore_index flag.
With this set to true, the concatenation will create a new integer index for the resulting Series:
display('x', 'y', 'pd.concat([x, y], ignore_index=True)')keys option to specify a label for the data sources; the result will be a hierarchically indexed series containing the data:
display('x', 'y', "pd.concat([x, y], keys=['x', 'y'])")DataFrame, and we can use the tools discussed in Hierarchical Indexing to transform this data into the representation we're interested in.DataFrames with shared column names.
In practice, data from different sources might have different sets of column names, and pd.concat offers several options in this case.
Consider the concatenation of the following two DataFrames, which have some (but not all!) columns in common:
df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3, 4])
display('df5', 'df6', 'pd.concat([df5, df6])')join and join_axes parameters of the concatenate function.
By default, the join is a union of the input columns (join='outer'), but we can change this to an intersection of the columns using join='inner':
display('df5', 'df6',
"pd.concat([df5, df6], join='inner')")join_axes argument, which takes a list of index objects.
Here we'll specify that the returned columns should be the same as those of the first input:
display('df5', 'df6',
"pd.concat([df5, df6], join_axes=[df5.columns])")pd.concat function allows a wide range of possible behaviors when joining two datasets; keep these in mind as you use these tools for your own data.append() methodSeries and DataFrame objects have an append method that can accomplish the same thing in fewer keystrokes.
For example, rather than calling pd.concat([df1, df2]), you can simply call df1.append(df2):
display('df1', 'df2', 'df1.append(df2)')