[n_samples, n_features]
format.
In the real world, data rarely comes in such a form.
With this in mind, one of the more important steps in using machine learning in practice is feature engineering: that is, taking whatever information you have about your problem and turning it into numbers that you can use to build your feature matrix.data = [
{'price': 850000, 'rooms': 4, 'neighborhood': 'Queen Anne'},
{'price': 700000, 'rooms': 3, 'neighborhood': 'Fremont'},
{'price': 650000, 'rooms': 3, 'neighborhood': 'Wallingford'},
{'price': 600000, 'rooms': 2, 'neighborhood': 'Fremont'}
]
{'Queen Anne': 1, 'Fremont': 2, 'Wallingford': 3};
DictVectorizer
will do this for you:from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False, dtype=int)
vec.fit_transform(data)
vec.get_feature_names()
vec = DictVectorizer(sparse=True, dtype=int)
vec.fit_transform(data)
sklearn.preprocessing.OneHotEncoder
and sklearn.feature_extraction.FeatureHasher
are two additional tools that Scikit-Learn includes to support this type of encoding.sample = ['problem of evil',
'evil queen',
'horizon problem']
CountVectorizer
:from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
X = vec.fit_transform(sample)
X
DataFrame
with labeled columns:import pandas as pd
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
X = vec.fit_transform(sample)
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
x = np.array([1, 2, 3, 4, 5])
y = np.array([4, 2, 1, 3, 7])
plt.scatter(x, y);
LinearRegression
and get the optimal result:from sklearn.linear_model import LinearRegression
X = x[:, np.newaxis]
model = LinearRegression().fit(X, y)
yfit = model.predict(X)
plt.scatter(x, y)
plt.plot(x, yfit);
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=3, include_bias=False)
X2 = poly.fit_transform(X)
print(X2)
model = LinearRegression().fit(X2, y)
yfit = model.predict(X2)
plt.scatter(x, y)
plt.plot(x, yfit);
DataFrame
s in Handling Missing Data, and saw that often the NaN
value is used to mark missing values.
For example, we might have a dataset that looks like this:from numpy import nan
X = np.array([[ nan, 0, 3 ],
[ 3, 7, 9 ],
[ 3, 5, 2 ],
[ 4, nan, 6 ],
[ 8, 8, 1 ]])
y = np.array([14, 16, -1, 8, -5])
Imputer
class:from sklearn.preprocessing import Imputer
imp = Imputer(strategy='mean')
X2 = imp.fit_transform(X)
X2
LinearRegression
estimator:model = LinearRegression().fit(X2, y)
model.predict(X2)
Pipeline
object, which can be used as follows:from sklearn.pipeline import make_pipeline
model = make_pipeline(Imputer(strategy='mean'),
PolynomialFeatures(degree=2),
LinearRegression())
model.fit(X, y) # X with missing values, from above
print(y)
print(model.predict(X))