# Scikit-Learn

Scikit-learn是开源的Python库，通过统一的界面实现机器学习、预处理、交叉验证及可视化算法。

## 简例

In [93]:
# 导入工具库
from sklearn import neighbors, datasets, preprocessing
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# 加载数据
iris = datasets.load_iris()
X, y = iris.data[:, :2], iris.target

# 切分数据集
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33)

# 数据预处理
scaler = preprocessing.StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# 训练与预测
knn = neighbors.KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

# 评估
accuracy_score(y_test, y_pred)

0.631578947368421

## 加载数据

Scikit-learn处理的数据是存储为NumPy数组或SciPy稀疏矩阵的数字，还支持Pandas数据框等可转换为数字数组的其它数据类型。

In [94]:
import numpy as np

In [95]:
X = np.random.random((10, 5))

In [96]:
y = np.array(["M", "M", "F", "F", "M", "F", "M", "M", "F", "F"])

In [97]:
X[X < 0.7] = 0

## 训练/测试集切分

In [98]:
from sklearn.model_selection import train_test_split

In [99]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

## 数据预处理

### 标准化

In [100]:
from sklearn.preprocessing import StandardScaler

In [101]:
scaler = StandardScaler().fit(X_train)  # 拟合

In [102]:
standardized_X = scaler.transform(X_train)  # 训练集变换

In [103]:
standardized_X_test = scaler.transform(X_test)  # 测试集变换

### 归一化

In [104]:
from sklearn.preprocessing import Normalizer

In [105]:
scaler = Normalizer().fit(X_train)  # 拟合

In [106]:
normalized_X = scaler.transform(X_train)  # 训练集变换

In [107]:
normalized_X_test = scaler.transform(X_test)  # 测试集变换

### 二值化

In [108]:
from sklearn.preprocessing import Binarizer

In [109]:
binarizer = Binarizer(threshold=0.0).fit(X)  # 拟合

In [110]:
binary_X = binarizer.transform(X)  # 变换

### 编码分类特征

In [111]:
from sklearn.preprocessing import LabelEncoder

In [112]:
enc = LabelEncoder()

In [113]:
y = enc.fit_transform(y)

### 缺失值处理

In [114]:
from sklearn.impute import SimpleImputer

In [115]:
imp = SimpleImputer(missing_values=0, strategy="mean")  # 均值填充器

In [116]:
imp.fit_transform(X_train)  # 对数据进行缺失值均值填充变换

array([[0.79652038, 0.92301062, 0.86399495, 0.82339757, 0.74378358],
       [0.93489692, 0.92301062, 0.89916118, 0.82339757, 0.74378358],
       [0.79652038, 0.92301062, 0.86399495, 0.82339757, 0.74378358],
       [0.73038114, 0.92301062, 0.86399495, 0.80657447, 0.74378358],
       [0.79652038, 0.92301062, 0.86399495, 0.84022067, 0.74378358],
       [0.72428308, 0.92301062, 0.82882871, 0.82339757, 0.74378358],
       [0.79652038, 0.92301062, 0.86399495, 0.82339757, 0.74378358]])

### 生成多项式特征

In [117]:
from sklearn.preprocessing import PolynomialFeatures

In [118]:
poly = PolynomialFeatures(5)

In [119]:
poly.fit_transform(X)

array([[1.        , 0.72428308, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [1.        , 0.93489692, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [1.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [1.        , 0.73038114, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [1.        , 0.71508617, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [1.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

## 创建模型

### 有监督学习评估器

**线性回归**

In [120]:
from sklearn.linear_model import LinearRegression

In [121]:
lr = LinearRegression()

**支持向量机(SVM)**

In [122]:
from sklearn.svm import SVC

In [123]:
svc = SVC(kernel="linear")

**朴素贝叶斯**

In [124]:
from sklearn.naive_bayes import GaussianNB

In [125]:
gnb = GaussianNB()

**KNN**

In [126]:
from sklearn import neighbors

In [127]:
knn = neighbors.KNeighborsClassifier(n_neighbors=5)

### 无监督学习评估器

**主成分分析(PCA)**

In [128]:
from sklearn.decomposition import PCA

In [129]:
pca = PCA(n_components=0.95)

**K-Means聚类**

In [130]:
from sklearn.cluster import KMeans

In [131]:
k_means = KMeans(n_clusters=3, random_state=0)

In [132]:
1.0  ## 模型拟合

1.0

### 有监督学习

In [133]:
lr.fit(X, y)  # 拟合数据与模型

In [134]:
knn.fit(X_train, y_train)

In [135]:
svc.fit(X_train, y_train)

### 无监督学习

In [136]:
k_means.fit(X_train)  # 拟合数据与模型



In [137]:
pca_model = pca.fit_transform(X_train)  # 拟合并转换数据

## 预测

### 有监督评估器

In [138]:
y_pred = svc.predict(np.random.random((2, 5)))  # 预测标签

In [139]:
y_pred = lr.predict(X_test)  # 预测标签

In [140]:
y_pred = knn.predict_proba(X_test)  # 评估标签概率

### 无监督评估器

In [141]:
y_pred = k_means.predict(X_test)  # 预测聚类算法里的标签

## 评估模型性能

### 分类评价指标

**准确率**

In [142]:
svc.fit(X_train, y_train)
svc.score(X_test, y_test)  # 评估器评分法

0.3333333333333333

In [143]:
from sklearn.metrics import accuracy_score  # 指标评分函数

In [144]:
y_pred = svc.predict(X_test)
accuracy_score(y_test, y_pred)  # 评估accuracy

0.3333333333333333

**分类预估评价函数**

In [145]:
from sklearn.metrics import classification_report  # 精确度、召回率、F1分数及支持率

In [146]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           F       0.50      0.50      0.50         2
           M       0.00      0.00      0.00         1

    accuracy                           0.33         3
   macro avg       0.25      0.25      0.25         3
weighted avg       0.33      0.33      0.33         3



**混淆矩阵**

In [147]:
from sklearn.metrics import confusion_matrix

In [148]:
print(confusion_matrix(y_test, y_pred))

[[1 1]
 [1 0]]


### 回归评价指标

**平均绝对误差**

In [149]:
from sklearn.metrics import mean_absolute_error
import pandas as pd
import numpy as np

In [150]:
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]
X, y = data, target
house_X_train, house_X_test, house_y_train, house_y_test = train_test_split(
    X, y, random_state=0
)

In [151]:
from sklearn.tree import DecisionTreeRegressor

dt = DecisionTreeRegressor().fit(house_X_train, house_y_train)
house_y_pred = dt.predict(house_X_test)
mean_absolute_error(house_y_test, house_y_pred)

3.2968503937007867

**均方误差**

In [152]:
from sklearn.metrics import mean_squared_error

In [153]:
mean_squared_error(house_y_test, house_y_pred)

28.04307086614173

**R^2评分**

In [154]:
from sklearn.metrics import r2_score

In [155]:
r2_score(house_y_test, house_y_pred)

0.6567514220852202

### 聚类评价指标

**调整兰德系数**

In [156]:
from sklearn.metrics import adjusted_rand_score

In [157]:
adjusted_rand_score(y_test, y_pred)

NameError: name 'y_true' is not defined

**同质性**

In [None]:
from sklearn.metrics import homogeneity_score

In [None]:
homogeneity_score(y_test, y_pred)

**V-measure**

In [162]:
import sklearn.metrics as metrics

metrics.v_measure_score(y_test, y_pred)

0.2740175421212811

### 交叉验证

In [163]:
from sklearn.model_selection import cross_val_score

In [164]:
print(cross_val_score(knn, X_train, y_train, cv=4))

[0.5 0.5 0.5 1. ]




In [165]:
print(cross_val_score(lr, X, y, cv=2))

[ 0.6069688  -2.25273434]


## 模型调参与优化

### 随机搜索超参优化

In [170]:
from sklearn.model_selection import RandomizedSearchCV

params = {"n_neighbors": range(1, 5), "weights": ["uniform", "distance"]}

rsearch = RandomizedSearchCV(
    estimator=knn, param_distributions=params, cv=4, n_iter=8, random_state=5
)

rsearch.fit(X_train, y_train)
print(rsearch.best_score_)

0.75


