arr[0]
), slices (e.g., arr[:5]
), and Boolean masks (e.g., arr[arr > 0]
).
In this section, we'll look at another style of array indexing, known as fancy indexing.
Fancy indexing is like the simple indexing we've already seen, but we pass arrays of indices in place of single scalars.
This allows us to very quickly access and modify complicated subsets of an array's values.import numpy as np
rand = np.random.RandomState(42)
x = rand.randint(100, size=10)
print(x)
[x[3], x[7], x[2]]
ind = [3, 7, 4]
x[ind]
ind = np.array([[3, 7],
[4, 5]])
x[ind]
X = np.arange(12).reshape((3, 4))
X
row = np.array([0, 1, 2])
col = np.array([2, 1, 3])
X[row, col]
X[0, 2]
, the second is X[1, 1]
, and the third is X[2, 3]
.
The pairing of indices in fancy indexing follows all the broadcasting rules that were mentioned in Computation on Arrays: Broadcasting.
So, for example, if we combine a column vector and a row vector within the indices, we get a two-dimensional result:X[row[:, np.newaxis], col]
row[:, np.newaxis] * col
print(X)
X[2, [2, 0, 1]]
X[1:, [2, 0, 1]]
mask = np.array([1, 0, 1, 0], dtype=bool)
X[row[:, np.newaxis], mask]
mean = [0, 0]
cov = [[1, 2],
[2, 5]]
X = rand.multivariate_normal(mean, cov, 100)
X.shape
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set() # for plot styling
plt.scatter(X[:, 0], X[:, 1]);
indices = np.random.choice(X.shape[0], 20, replace=False)
indices
selection = X[indices] # fancy indexing here
selection.shape
plt.scatter(X[:, 0], X[:, 1], alpha=0.3)
plt.scatter(selection[:, 0], selection[:, 1],
facecolor='none', s=200);
x = np.arange(10)
i = np.array([2, 1, 8, 4])
x[i] = 99
print(x)
x[i] -= 10
print(x)
x = np.zeros(10)
x[[0, 0]] = [4, 6]
print(x)
x[0] = 4
, followed by x[0] = 6
.
The result, of course, is that x[0]
contains the value 6.i = [2, 3, 3, 4, 4, 4]
x[i] += 1
x
x[3]
would contain the value 2, and x[4]
would contain the value 3, as this is how many times each index is repeated. Why is this not the case?
Conceptually, this is because x[i] += 1
is meant as a shorthand of x[i] = x[i] + 1
. x[i] + 1
is evaluated, and then the result is assigned to the indices in x.
With this in mind, it is not the augmentation that happens multiple times, but the assignment, which leads to the rather nonintuitive results.at()
method of ufuncs (available since NumPy 1.8), and do the following:x = np.zeros(10)
np.add.at(x, i, 1)
print(x)
at()
method does an in-place application of the given operator at the specified indices (here, i
) with the specified value (here, 1).
Another method that is similar in spirit is the reduceat()
method of ufuncs, which you can read about in the NumPy documentation.ufunc.at
like this:np.random.seed(42)
x = np.random.randn(100)
# compute a histogram by hand
bins = np.linspace(-5, 5, 20)
counts = np.zeros_like(bins)
# find the appropriate bin for each x
i = np.searchsorted(bins, x)
# add 1 to each of these bins
np.add.at(counts, i, 1)
# plot the results
plt.plot(bins, counts, linestyle='steps');
plt.hist()
routine, which does the same in a single line:plt.hist(x, bins, histtype='step');
matplotlib
uses the np.histogram
function, which does a very similar computation to what we did before. Let's compare the two here:print("NumPy routine:")
%timeit counts, edges = np.histogram(x, bins)
print("Custom routine:")
%timeit np.add.at(counts, np.searchsorted(bins, x), 1)
np.histogram
source code (you can do this in IPython by typing np.histogram??
), you'll see that it's quite a bit more involved than the simple search-and-count that we've done; this is because NumPy's algorithm is more flexible, and particularly is designed for better performance when the number of data points becomes large:x = np.random.randn(1000000)
print("NumPy routine:")
%timeit counts, edges = np.histogram(x, bins)
print("Custom routine:")
%timeit np.add.at(counts, np.searchsorted(bins, x), 1)