PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows. https://pycaret.org/
pip install pycaret
from pycaret.utils import enable_colab
enable_colab()
from pycaret.datasets import get_data
dataset = get_data('iris')
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
data = dataset.sample(frac=0.9, random_state=786)
data_unseen = dataset.drop(data.index)
data.reset_index(drop=True, inplace=True)
data_unseen.reset_index(drop=True, inplace=True)
print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))
setup()
を使って前処理をします。from pycaret.classification import *
exp_mclf101 = setup(data = data, target = 'species', session_id=123)
Description | Value | |
---|---|---|
0 | session_id | 123 |
1 | Target | species |
2 | Target Type | Multiclass |
3 | Label Encoded | Iris-setosa: 0, Iris-versicolor: 1, Iris-virginica: 2 |
4 | Original Data | (135, 5) |
5 | Missing Values | False |
6 | Numeric Features | 4 |
7 | Categorical Features | 0 |
8 | Ordinal Features | False |
9 | High Cardinality Features | False |
10 | High Cardinality Method | None |
11 | Transformed Train Set | (94, 4) |
12 | Transformed Test Set | (41, 4) |
13 | Shuffle Train-Test | True |
14 | Stratify Train-Test | False |
15 | Fold Generator | StratifiedKFold |
16 | Fold Number | 10 |
17 | CPU Jobs | -1 |
18 | Use GPU | False |
19 | Log Experiment | False |
20 | Experiment Name | clf-default-name |
21 | USI | caab |
22 | Imputation Type | simple |
23 | Iterative Imputation Iteration | None |
24 | Numeric Imputer | mean |
25 | Iterative Imputation Numeric Model | None |
26 | Categorical Imputer | constant |
27 | Iterative Imputation Categorical Model | None |
28 | Unknown Categoricals Handling | least_frequent |
29 | Normalize | False |
30 | Normalize Method | None |
31 | Transformation | False |
32 | Transformation Method | None |
33 | PCA | False |
34 | PCA Method | None |
35 | PCA Components | None |
36 | Ignore Low Variance | False |
37 | Combine Rare Levels | False |
38 | Rare Level Threshold | None |
39 | Numeric Binning | False |
40 | Remove Outliers | False |
41 | Outliers Threshold | None |
42 | Remove Multicollinearity | False |
43 | Multicollinearity Threshold | None |
44 | Remove Perfect Collinearity | True |
45 | Clustering | False |
46 | Clustering Iteration | None |
47 | Polynomial Features | False |
48 | Polynomial Degree | None |
49 | Trignometry Features | False |
50 | Polynomial Threshold | None |
51 | Group Features | False |
52 | Feature Selection | False |
53 | Feature Selection Method | classic |
54 | Features Selection Threshold | None |
55 | Feature Interaction | False |
56 | Feature Ratio | False |
57 | Interaction Threshold | None |
58 | Fix Imbalance | False |
59 | Fix Imbalance Method | SMOTE |
best = compare_models()
Model | Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | TT (Sec) | |
---|---|---|---|---|---|---|---|---|---|
lda | Linear Discriminant Analysis | 0.9678 | 0.9963 | 0.9667 | 0.9758 | 0.9669 | 0.9515 | 0.9560 | 0.0100 |
nb | Naive Bayes | 0.9578 | 0.9897 | 0.9556 | 0.9713 | 0.9546 | 0.9364 | 0.9442 | 0.0090 |
qda | Quadratic Discriminant Analysis | 0.9567 | 1.0000 | 0.9556 | 0.9708 | 0.9533 | 0.9348 | 0.9433 | 0.0100 |
lr | Logistic Regression | 0.9478 | 0.9963 | 0.9444 | 0.9638 | 0.9444 | 0.9212 | 0.9304 | 0.2640 |
knn | K Neighbors Classifier | 0.9467 | 0.9926 | 0.9444 | 0.9630 | 0.9432 | 0.9197 | 0.9291 | 0.1190 |
lightgbm | Light Gradient Boosting Machine | 0.9456 | 0.9852 | 0.9444 | 0.9625 | 0.9419 | 0.9182 | 0.9282 | 0.0290 |
ada | Ada Boost Classifier | 0.9256 | 0.9809 | 0.9222 | 0.9505 | 0.9194 | 0.8879 | 0.9026 | 0.0620 |
gbc | Gradient Boosting Classifier | 0.9256 | 0.9815 | 0.9222 | 0.9505 | 0.9194 | 0.8879 | 0.9026 | 0.1270 |
et | Extra Trees Classifier | 0.9256 | 0.9926 | 0.9222 | 0.9505 | 0.9194 | 0.8879 | 0.9026 | 0.4650 |
dt | Decision Tree Classifier | 0.9144 | 0.9369 | 0.9111 | 0.9366 | 0.9086 | 0.8712 | 0.8843 | 0.0100 |
rf | Random Forest Classifier | 0.9144 | 0.9852 | 0.9111 | 0.9305 | 0.9101 | 0.8712 | 0.8813 | 0.5310 |
svm | SVM - Linear Kernel | 0.8522 | 0.0000 | 0.8361 | 0.8261 | 0.8197 | 0.7755 | 0.8099 | 0.0640 |
ridge | Ridge Classifier | 0.8300 | 0.0000 | 0.8222 | 0.8544 | 0.8178 | 0.7433 | 0.7648 | 0.0080 |
dummy | Dummy Classifier | 0.3822 | 0.5000 | 0.3333 | 0.1480 | 0.2128 | 0.0000 | 0.0000 | 0.0070 |
create_model()
を使ってモデルを作っていきます。
ここでは、Logistic Regression ('lr') でモデルを作成してきます。lr = create_model('lr')
Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
---|---|---|---|---|---|---|---|
Fold | |||||||
0 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
1 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
2 | 0.9000 | 1.0000 | 0.8889 | 0.9250 | 0.8971 | 0.8485 | 0.8616 |
3 | 0.8000 | 1.0000 | 0.7778 | 0.8800 | 0.7750 | 0.6970 | 0.7435 |
4 | 0.8889 | 0.9630 | 0.8889 | 0.9167 | 0.8857 | 0.8333 | 0.8492 |
5 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
6 | 0.8889 | 1.0000 | 0.8889 | 0.9167 | 0.8857 | 0.8333 | 0.8492 |
7 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
8 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
9 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Mean | 0.9478 | 0.9963 | 0.9444 | 0.9638 | 0.9444 | 0.9212 | 0.9304 |
Std | 0.0689 | 0.0111 | 0.0745 | 0.0456 | 0.0751 | 0.1041 | 0.0905 |
#trained model object is stored in the variable 'lr'.
print(lr)
tune_model()
を使い、ハイパーパラメータチューニングを実施します。Random Grid Search
とのことです。また、 Accuracy
を optimize するとのことです。tuned_lr = tune_model(lr)
Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
---|---|---|---|---|---|---|---|
Fold | |||||||
0 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
1 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
2 | 0.9000 | 1.0000 | 0.8889 | 0.9250 | 0.8971 | 0.8485 | 0.8616 |
3 | 0.8000 | 1.0000 | 0.7778 | 0.8800 | 0.7750 | 0.6970 | 0.7435 |
4 | 0.8889 | 1.0000 | 0.8889 | 0.9167 | 0.8857 | 0.8333 | 0.8492 |
5 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
6 | 0.8889 | 1.0000 | 0.8889 | 0.9167 | 0.8857 | 0.8333 | 0.8492 |
7 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
8 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
9 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Mean | 0.9478 | 1.0000 | 0.9444 | 0.9638 | 0.9444 | 0.9212 | 0.9304 |
Std | 0.0689 | 0.0000 | 0.0745 | 0.0456 | 0.0751 | 0.1041 | 0.0905 |
#tuned model object is stored in the variable 'tuned_dt'.
print(tuned_lr)
plot_model()
を使ってプロットしていきます。plot_model(tuned_lr, plot = 'confusion_matrix')
plot_model(tuned_lr, plot = 'class_report')
plot_model(tuned_lr, plot='boundary')
plot_model(tuned_lr, plot = 'error')
evaluate_model()
を使うとユーザーインターフェースでプロットを見ることができます。evaluate_model(tuned_lr)
Parameters | |
---|---|
C | 2.833 |
class_weight | balanced |
dual | False |
fit_intercept | True |
intercept_scaling | 1 |
l1_ratio | None |
max_iter | 1000 |
multi_class | auto |
n_jobs | None |
penalty | l2 |
random_state | 123 |
solver | lbfgs |
tol | 0.0001 |
verbose | 0 |
warm_start | False |
predict_model(tuned_lr);
Model | Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
---|---|---|---|---|---|---|---|---|
0 | Logistic Regression | 0.9512 | 1.0000 | 0.9556 | 0.9566 | 0.9509 | 0.9253 | 0.9287 |
finalize_model()
を使いモデルを仕上げていきます。final_lr = finalize_model(tuned_lr)
print(final_lr)
unseen_predictions = predict_model(final_knn, data=data_unseen)
unseen_predictions.head()
Model | Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
---|---|---|---|---|---|---|---|---|
0 | K Neighbors Classifier | 0.0000 | 1.0000 | 0 | 0 | 0 | 0 | 0 |
sepal_length | sepal_width | petal_length | petal_width | species | Label | Score | |
---|---|---|---|---|---|---|---|
0 | 5.4 | 3.9 | 1.7 | 0.4 | Iris-setosa | Iris-setosa | 1.0000 |
1 | 5.4 | 3.4 | 1.7 | 0.2 | Iris-setosa | Iris-setosa | 1.0000 |
2 | 5.1 | 3.3 | 1.7 | 0.5 | Iris-setosa | Iris-setosa | 1.0000 |
3 | 4.8 | 3.1 | 1.6 | 0.2 | Iris-setosa | Iris-setosa | 1.0000 |
4 | 6.9 | 3.1 | 4.9 | 1.5 | Iris-versicolor | Iris-versicolor | 0.5455 |
unseen_predictions = predict_model(final_lr, data=data_unseen)
unseen_predictions.head()
Model | Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
---|---|---|---|---|---|---|---|---|
0 | Logistic Regression | 0.0000 | 1.0000 | 0 | 0 | 0 | 0 | 0 |
sepal_length | sepal_width | petal_length | petal_width | species | Label | Score | |
---|---|---|---|---|---|---|---|
0 | 5.4 | 3.9 | 1.7 | 0.4 | Iris-setosa | Iris-setosa | 0.9831 |
1 | 5.4 | 3.4 | 1.7 | 0.2 | Iris-setosa | Iris-setosa | 0.9644 |
2 | 5.1 | 3.3 | 1.7 | 0.5 | Iris-setosa | Iris-setosa | 0.9699 |
3 | 4.8 | 3.1 | 1.6 | 0.2 | Iris-setosa | Iris-setosa | 0.9781 |
4 | 6.9 | 3.1 | 4.9 | 1.5 | Iris-versicolor | Iris-versicolor | 0.8227 |
Label
と Score
が算出されているのが、わかるかと思います。save_model()
を使い、 load は load_model()
を使います。save_model(final_lr,'Final lr Model')
saved_final_lr = load_model('Final lr Model')
new_prediction = predict_model(saved_final_lr, data=data_unseen)
new_prediction.head()
Model | Accuracy | AUC | Recall | Prec. | F1 | Kappa | MCC | |
---|---|---|---|---|---|---|---|---|
0 | Logistic Regression | 0.0000 | 1.0000 | 0 | 0 | 0 | 0 | 0 |
sepal_length | sepal_width | petal_length | petal_width | species | Label | Score | |
---|---|---|---|---|---|---|---|
0 | 5.4 | 3.9 | 1.7 | 0.4 | Iris-setosa | Iris-setosa | 0.9831 |
1 | 5.4 | 3.4 | 1.7 | 0.2 | Iris-setosa | Iris-setosa | 0.9644 |
2 | 5.1 | 3.3 | 1.7 | 0.5 | Iris-setosa | Iris-setosa | 0.9699 |
3 | 4.8 | 3.1 | 1.6 | 0.2 | Iris-setosa | Iris-setosa | 0.9781 |
4 | 6.9 | 3.1 | 4.9 | 1.5 | Iris-versicolor | Iris-versicolor | 0.8227 |