DataFrame
using the seaborn library:import seaborn as sns
iris = sns.load_dataset('iris')
iris.head()
n_samples
.n_features
.X
.
The features matrix is assumed to be two-dimensional, with shape [n_samples, n_features]
, and is most often contained in a NumPy array or a Pandas DataFrame
, though some Scikit-Learn models also accept SciPy sparse matrices.X
, we also generally work with a label or target array, which by convention we will usually call y
.
The target array is usually one dimensional, with length n_samples
, and is generally contained in a NumPy array or Pandas Series
.
The target array may have continuous numerical values, or discrete classes/labels.
While some Scikit-Learn estimators do handle multiple target values in the form of a two-dimensional, [n_samples, n_targets]
target array, we will primarily be working with the common case of a one-dimensional target array.species
column would be considered the target array.%matplotlib inline
import seaborn as sns; sns.set()
sns.pairplot(iris, hue='species', size=1.5);
DataFrame
, which we can do using some of the Pandas DataFrame
operations discussed in the Chapter 3:X_iris = iris.drop('species', axis=1)
X_iris.shape
y_iris = iris['species']
y_iris.shape
DataFrame
s, SciPy sparse matrices) and parameter
names use standard Python strings.fit()
method of the model instance.predict()
method.transform()
or predict()
method.import matplotlib.pyplot as plt
import numpy as np
rng = np.random.RandomState(42)
x = 10 * rng.rand(50)
y = 2 * x - 1 + rng.randn(50)
plt.scatter(x, y);
from sklearn.linear_model import LinearRegression
sklearn.linear_model
module documentation.LinearRegression
class and specify that we would like to fit the intercept using the fit_intercept
hyperparameter:model = LinearRegression(fit_intercept=True)
model
y
is already in the correct form (a length-n_samples
array), but we need to massage the data x
to make it a matrix of size [n_samples, n_features]
.
In this case, this amounts to a simple reshaping of the one-dimensional array:X = x[:, np.newaxis]
X.shape
fit()
method of the model:model.fit(X, y)
fit()
command causes a number of model-dependent internal computations to take place, and the results of these computations are stored in model-specific attributes that the user can explore.
In Scikit-Learn, by convention all model parameters that were learned during the fit()
process have trailing underscores; for example in this linear model, we have the following:model.coef_
model.intercept_
predict()
method.
For the sake of this example, our "new data" will be a grid of x values, and we will ask what y values the model predicts:xfit = np.linspace(-1, 11)
[n_samples, n_features]
features matrix, after which we can feed it to the model:Xfit = xfit[:, np.newaxis]
yfit = model.predict(Xfit)
plt.scatter(x, y)
plt.plot(xfit, yfit);
train_test_split
utility function:from sklearn.cross_validation import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X_iris, y_iris,
random_state=1)
from sklearn.naive_bayes import GaussianNB # 1. choose model class
model = GaussianNB() # 2. instantiate model
model.fit(Xtrain, ytrain) # 3. fit model to data
y_model = model.predict(Xtest) # 4. predict on new data
accuracy_score
utility to see the fraction of predicted labels that match their true value:from sklearn.metrics import accuracy_score
accuracy_score(ytest, y_model)
from sklearn.decomposition import PCA # 1. Choose the model class
model = PCA(n_components=2) # 2. Instantiate the model with hyperparameters
model.fit(X_iris) # 3. Fit to data. Notice y is not specified!
X_2D = model.transform(X_iris) # 4. Transform the data to two dimensions
DataFrame
, and use Seaborn's lmplot
to show the results:iris['PCA1'] = X_2D[:, 0]
iris['PCA2'] = X_2D[:, 1]
sns.lmplot("PCA1", "PCA2", hue='species', data=iris, fit_reg=False);
from sklearn.mixture import GMM # 1. Choose the model class
model = GMM(n_components=3,
covariance_type='full') # 2. Instantiate the model with hyperparameters
model.fit(X_iris) # 3. Fit to data. Notice y is not specified!
y_gmm = model.predict(X_iris) # 4. Determine cluster labels
DataFrame
and use Seaborn to plot the results:iris['cluster'] = y_gmm
sns.lmplot("PCA1", "PCA2", data=iris, hue='species',
col='cluster', fit_reg=False);
from sklearn.datasets import load_digits
digits = load_digits()
digits.images.shape
import matplotlib.pyplot as plt
fig, axes = plt.subplots(10, 10, figsize=(8, 8),
subplot_kw={'xticks':[], 'yticks':[]},
gridspec_kw=dict(hspace=0.1, wspace=0.1))
for i, ax in enumerate(axes.flat):
ax.imshow(digits.images[i], cmap='binary', interpolation='nearest')
ax.text(0.05, 0.05, str(digits.target[i]),
transform=ax.transAxes, color='green')
[n_samples, n_features]
representation.
We can accomplish this by treating each pixel in the image as a feature: that is, by flattening out the pixel arrays so that we have a length-64 array of pixel values representing each digit.
Additionally, we need the target array, which gives the previously determined label for each digit.
These two quantities are built into the digits dataset under the data
and target
attributes, respectively:X = digits.data
X.shape
y = digits.target
y.shape
from sklearn.manifold import Isomap
iso = Isomap(n_components=2)
iso.fit(digits.data)
data_projected = iso.transform(digits.data)
data_projected.shape
plt.scatter(data_projected[:, 0], data_projected[:, 1], c=digits.target,
edgecolor='none', alpha=0.5,
cmap=plt.cm.get_cmap('spectral', 10))
plt.colorbar(label='digit label', ticks=range(10))
plt.clim(-0.5, 9.5);
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0)
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(Xtrain, ytrain)
y_model = model.predict(Xtest)
from sklearn.metrics import accuracy_score
accuracy_score(ytest, y_model)