SKlearn学习笔记

本文为我在学习中记录的函数，并加以拓展。

SKlearn 方法

train_test_split() 将数组或矩阵分解为随机序列的训练和测试子集

1
2
3

from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

random_state 参数相当于设置随机数种子
stratify 如果不是“null”，则将数据以分层方式拆分，将其用作类标签。

StratifiedShuffleSplit() 将数据分为多对 train/test 集并随机打乱

1 2	from sklearn.model_selection import StratifiedShuffleSplit StratifiedShuffleSplit(n_splits=10,test_size=None,train_size=None, random_state=None)

n_splits 是将训练数据分成 train/test 对的组数，可根据需要进行设置，默认为 10
参数 test_size 和 train_size 是用来设置 train/test 对中 train 和 test 所占的比例
参数 random_state 相当于随机数种子

CategoricalEncoder 类将 array 使用 onehot 或 ordinal 编码

返回一个 sparse array，可以使用toarray()转换为 dense array，或者指定编码类型为onehot-dense来得到 dense matrix。

from sklearn.preprocessing import CategoricalEncoder

cat_encoder = CategoricalEncoder()
housing_cat_reshaped = housing_cat.values.reshape(-1, 1)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat_reshaped)
housing_cat_1hot

encoding : str, ‘onehot’, ‘onehot-dense’ or ‘ordinal’，指定编码类型，默认为 onehot。
categories : ‘auto’ or a list of lists/arrays of values.
dtype : number type, default np.float64
handle_unknown : ‘error’ (default) or ‘ignore’

MinMaxScaler() MinMax scaling 归一化

>>> from sklearn.preprocessing import MinMaxScaler
>>>
>>> data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
>>> scaler = MinMaxScaler()
>>> print(scaler.fit(data))
MinMaxScaler(copy=True, feature_range=(0, 1))
>>> print(scaler.data_max_)
[  1.  18.]
>>> print(scaler.transform(data))
[[ 0.    0.  ]
 [ 0.25  0.25]
 [ 0.5   0.5 ]
 [ 1.    1.  ]]
>>> print(scaler.transform([[2, 2]]))
[[ 1.5  0. ]]

feature_range : tuple (min, max), default=(0, 1)，归一化后值的范围
copy : boolean, optional, default True，是否复制数据在新的数据上归一化

StandardScaler() 0 均值标准化

>>> from sklearn.preprocessing import StandardScaler
>>>
>>> data = [[0, 0], [0, 0], [1, 1], [1, 1]]
>>> scaler = StandardScaler()
>>> print(scaler.fit(data))
StandardScaler(copy=True, with_mean=True, with_std=True)
>>> print(scaler.mean_)
[ 0.5  0.5]
>>> print(scaler.transform(data))
[[-1. -1.]
 [-1. -1.]
 [ 1.  1.]
 [ 1.  1.]]
>>> print(scaler.transform([[2, 2]]))
[[ 3.  3.]]

copy : boolean, optional, default True，是否复制数据在新的数据上执行
with_mean : boolean, True by default，若为 True 则在缩放前将数据居中。但在稀疏矩阵上是行不通的。
with_std : boolean, True by default，若为 True，则将数据放缩到单位方差或等效于单位标准差

mean_squared_error() 均方误差（MSE and to RMSE）

from sklearn.metrics import mean_squared_error

housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

y_true : array-like of shape = (n_samples) or (n_samples, n_outputs) 真实值
y_pred : array-like of shape = (n_samples) or (n_samples, n_outputs) 预测值

返回： loss : float or ndarray of floats

mean_absolute_error() 平均绝对误差（MAE）

from sklearn.metrics import mean_absolute_error

lin_mae = mean_absolute_error(housing_labels, housing_predictions)
lin_mae

y_true : array-like of shape = (n_samples) or (n_samples, n_outputs) 真实值
y_pred : array-like of shape = (n_samples) or (n_samples, n_outputs) 预测值

返回：loss : float or ndarray of floats

LinearRegression() 线性回归模型

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)
# Out:LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

# let's try the full pipeline on a few training instances
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)

print("Predictions:", lin_reg.predict(some_data_prepared))
# Out:Predictions: [ 210644.60459286  317768.80697211  210956.43331178   59218.98886849  189747.55849879]

Methods	description
fit(X, y[, sample_weight])	Fit linear model.
get_params([deep])	Get parameters for this estimator.
predict(X)	Predict using the linear model
score(X, y[, sample_weight])	Returns the coefficient of determination R^2 of the prediction.
set_params(**params)	Set the parameters of this estimator.

DecisionTreeRegressor() 决策树模型

from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(housing_prepared, housing_labels)
# Out:
# DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
#            max_leaf_nodes=None, min_impurity_decrease=0.0,
#            min_impurity_split=None, min_samples_leaf=1,
#            min_samples_split=2, min_weight_fraction_leaf=0.0,
#            presort=False, random_state=42, splitter='best')

housing_predictions = tree_reg.predict(housing_prepared)
# 计算RMSE
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse
# Out: 0.0

Methods	description
apply(X[, check_input])	Returns the index of the leaf that each sample is predicted as.
decision_path(X[, check_input])	Return the decision path in the tree
fit(X, y[, sample_weight, check_input, …])	Build a decision tree regressor from the training set (X, y).
get_params([deep])	Get parameters for this estimator.
predict(X[, check_input])	Predict class or regression value for X.
score(X, y[, sample_weight])	Returns the coefficient of determination R^2 of the prediction.
set_params(**params)	Set the parameters of this estimator.

RandomForestRegressor() 随机森林回归

from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(random_state=42)
forest_reg.fit(housing_prepared, housing_labels)

housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

Methods	description
apply(X)	Apply trees in the forest to X, return leaf indices.
decision_path(X)	Return the decision path in the forest
fit(X, y[, sample_weight])	Build a forest of trees from the training set (X, y).
get_params([deep])	Get parameters for this estimator.
predict(X)	Predict regression target for X.
score(X, y[, sample_weight])	Returns the coefficient of determination R^2 of the prediction.
set_params(**params)	Set the parameters of this estimator.

cross_val_score() K-fold 交叉验证

它的期望是一个效用函数越大越好，所以它的评分函数是一个负值。这就是为什么在计算开平方时取相反数（-scores）

from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
                         scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)

estimator : estimator object implementing ‘fit’ 用于拟合数据的对象，此处使用的是决策树 tree_reg。
X : array-like 要拟合的数据，可以是 list 或 array。
y : array-like, optional, default: None 监督学习下尝试预测的目标值
scoring : string, callable or None, optional, default: None 一个字符串，参见模型评估文档。
cv : int, cross-validation generator or an iterable, optional，决定交叉验证拆分策略，K-fold

joblib 保存模型

from sklearn.externals import joblib

joblib.dump(my_model, "my_model.pkl")
# and later...
my_model_loaded = joblib.load("my_model.pkl")

SVR() ε-SVM 回归

from sklearn.svm import SVR

svm_reg = SVR(kernel="linear")
svm_reg.fit(housing_prepared, housing_labels)
housing_predictions = svm_reg.predict(housing_prepared)
svm_mse = mean_squared_error(housing_labels, housing_predictions)
svm_rmse = np.sqrt(svm_mse)
svm_rmse

C : float, optional (default=1.0)，误差项惩罚参数

kernel : string, optional (default=’rbf’)，必须是‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ 或者提供一个函数，如果是提供了函数则将它用来预先计算核心矩阵。

Methods	description
fit(X, y[, sample_weight])	Fit the SVM model according to the given training data.
get_params([deep])	Get parameters for this estimator.
predict(X)	Perform regression on samples in X.
score(X, y[, sample_weight])	Returns the coefficient of determination R^2 of the prediction.
set_params(**params)	Set the parameters of this estimator.

GridSearchCV() 对估算器指定参数值进行详尽搜索

from sklearn.model_selection import GridSearchCV

param_grid = [
    # try 12 (3×4) combinations of hyperparameters
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    # then try 6 (2×3) combinations with bootstrap set as False
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
  ]

forest_reg = RandomForestRegressor(random_state=42)
# train across 5 folds, that's a total of (12+6)*5=90 rounds of training
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error')
grid_search.fit(housing_prepared, housing_labels)

grid_search.best_params_
# Out:{'max_features': 8, 'n_estimators': 30}

grid_search.best_estimator_
# Out:
# RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
#            max_features=8, max_leaf_nodes=None, min_impurity_decrease=0.0,
#            min_impurity_split=None, min_samples_leaf=1,
#            min_samples_split=2, min_weight_fraction_leaf=0.0,
#            n_estimators=30, n_jobs=1, oob_score=False, random_state=42,
#            verbose=0, warm_start=False)

estimator : estimator object.每个估算器需要提供一个score函数或填写scoring参数。
param_grid : dict or list of dictionaries，键作为参数名称，list 作为参数的字典。或存有这样的字典的列表。
scoring : string, callable, list/tuple, dict or None, default: None，
cv : int, cross-validation generator or an iterable, optional，如果是整数，则代表 KFold

refit : boolean, or string, default=True，应用已找到的最好的参数到整个数据集上。

Methods	description
decision_function(X)	Call decision_function on the estimator with the best found parameters.
fit(X[, y, groups])	Run fit with all sets of parameters.
get_params([deep])	Get parameters for this estimator.
inverse_transform(Xt)	Call inverse_transform on the estimator with the best found params.
predict(X)	Call predict on the estimator with the best found parameters.
predict_log_proba(X)	Call predict_log_proba on the estimator with the best found parameters.
predict_proba(X)	Call predict_proba on the estimator with the best found parameters.
score(X[, y])	Returns the score on the given data, if the estimator has been refit.
set_params(**params)	Set the parameters of this estimator.
transform(X)	Call transform on the estimator with the best found parameters.

RandomizedSearchCV()

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
        'n_estimators': randint(low=1, high=200),
        'max_features': randint(low=1, high=8),
    }

forest_reg = RandomForestRegressor(random_state=42)
rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,
                                n_iter=10, cv=5, scoring='neg_mean_squared_error', random_state=42)
rnd_search.fit(housing_prepared, housing_labels)

estimator : estimator object.指定估算器对象。
param_distributions : dict，给定以参数名为键，list 为参数的字典。或提供一个分布，分布必须提供一个rvs方法进行采样，例如来自 scipy.stats.distributions 的方法。
n_iter : int, default=10，采样参数设置数量。
scoring : string, callable, list/tuple, dict or None, default: None
cv : int, cross-validation generator or an iterable, optional
refit : boolean, or string default=True
random_state : int, RandomState instance or None, optional, default=None

Imputer() 处理丢失值

各属性必须是数值

from sklearn.preprocessing import Imputer
# 指定用何值替换丢失的值，此处为中位数
imputer = Imputer(strategy="median")

# 使实例适应数据
imputer.fit(housing_num)

# 结果在statistics_ 变量中
imputer.statistics_

# 替换
X = imputer.transform(housing_num)
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
                          index = list(housing.index.values))

# 预览
housing_tr.loc[sample_incomplete_rows.index.values]

fetch_mldata() 下载常用的数据集

1
2
3

# 例如下载MNIST数据集
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')

dataname : str；mldata.org 上的数据集的名称，原始名称会自动转换为 mldata.org 网址。

cross_val_predict() 交叉预测

为每个输入数据点生成交叉验证的估计值

1
2
3

from sklearn.model_selection import cross_val_predict

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

estimator : estimator object implementing ‘fit’ and ‘predict’
X : array-like
y : array-like, optional, default: None
cv : int, cross-validation generator or an iterable, optional

通过交叉验证得到 F1 分数

1 2	y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method="decision_function")

confusion_matrix() 计算混淆矩阵

计算混淆矩阵来评估分类的准确性

1
2
3

from sklearn.metrics import confusion_matrix

confusion_matrix(y_train_5, y_train_pred)

y_true : array, shape = [n_samples]; 正确的目标值
y_pred : array, shape = [n_samples]；分类器返回的目标估计值
labels : array, shape = [n_classes], optional；索引矩阵的标签列表。
sample_weight : array-like of shape = [n_samples], optional；样品权重

f1_score() 计算 F1 分数

F1 可以被解读为 precision 与 recall 的加权平均数，要得使 F1 得到高分，则必须使 precision 和 recall 高。

F1 score 计算公式

F1 = 2 * (precision * recall) / (precision + recall)

1 2	from sklearn.metrics import f1_score f1_score(y_train_5, y_train_pred)

y_true : 1d array-like, or label indicator array / sparse matrix；真实的目标值
y_pred : 1d array-like, or label indicator array / sparse matrix；分类器返回的目标估计值

average : string, [None, ‘binary’ (default), ‘micro’, ‘macro’, ‘samples’, ‘weighted’]; 这个参数需要 multiclass/multilabel 的目标。如果为空，每个类的分数被返回。否则，将执行下面的平均操作。

keys	description
‘binary’	Only report results for the class specified by poslabel. This is applicable only if targets (y{true,pred}) are binary.
‘micro’	Calculate metrics globally by counting the total true positives, false negatives and false positives.
‘macro’	Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
‘weighted’	Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.
‘samples’	Calculate metrics for each instance, and find their average (only meaningful for multilabel classification where this differs from accuracy_score).

decision_function() 返回每个实例的 F1 分数(方便使用阈值)

y_scores = sgd_clf.decision_function([some_digit])

threshold = 0   # 设定阈值
y_some_digit_pred = (y_scores > threshold)

threshold = 200000    # 设定阈值
y_some_digit_pred = (y_scores > threshold)

通过交叉验证得到分数

1 2	y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method="decision_function")

precision_recall_curve() 针对不同的概率阈值计算

precision ratio tp / (tp + fp); recall ratio tp / (tp + fn)

1
2
3

from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

参数

y_true : array, shape = [n_samples]; 在{-1, 1} or {0, 1}范围内，目标的二进制分类。
probas_pred : array, shape = [n_samples]；估计概率或决策函数。

返回值

precision : array, shape = [n_thresholds + 1]
recall : array, shape = [n_thresholds + 1]
thresholds : array, shape = [n_thresholds <= len(np.unique(probas_pred))]

precision_score()

The precision is the ratio tp / (tp + fp).

1	precision_score(y_train_5, y_train_pred_90)

参数

y_true : 1d array-like, or label indicator array / sparse matrix; 正确的目标实际值。
y_pred : 1d array-like, or label indicator array / sparse matrix; 由分类器返回的目标估计值
返回值
precision : float (if average is not None) or array of float, shape = [n_unique_labels]

recall_score()

The recall is the ratio tp / (tp + fn).

1	recall_score(y_train_5, y_train_pred_90)

参数

y_true : 1d array-like, or label indicator array / sparse matrix; 正确的目标实际值
y_pred : 1d array-like, or label indicator array / sparse matrix; 由分类器返回的目标估计值

返回值

recall : float (if average is not None) or array of float, shape = [n_unique_labels]

roc_curve() 计算 ROC

注意：此实现仅限于二进制分类任务。

1
2
3

from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

参数

y_true : array, shape = [n_samples]; 范围为{0, 1}或{-1, 1}的真实二元标签。如果标签不是二进制的，应明确给出 pos_label 参数。
y_score : array, shape = [n_samples]; 由一些分类器的”decision_function”返回，目标分数可以是positive class的概率估计、confidence values或非阈值化决策的量度。
pos_label : int or str, default=None; 标签被认为是 positive，其他的则被认为是 negative。

返回值

fpr : array, shape = [>2]
tpr : array, shape = [>2]
thresholds : array, shape = [n_thresholds]

roc_auc_score() 从预测分数计算 ROC AUC

注意：此实现仅限于二进制分类任务或标签指示符格式的多标签分类任务。

1
2
3

from sklearn.metrics import roc_auc_score

roc_auc_score(y_train_5, y_scores)

参数

y_true : array, shape = [n_samples] or [n_samples, n_classes]; 二进制标签指示符中的真实二进制标签
y_score : array, shape = [n_samples] or [n_samples, n_classes]; 由一些分类器的”decision_function”返回，目标分数可以是positive class的概率估计、confidence values或非阈值化决策的量度。

返回值

auc : float

~classes_ 数组(array)

分类器训练的目标分类，列表存储在它的 classes_属性中，顺序由值决定。例如

sgd_clf.classes_
# array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.])

sgd_clf.classes_[5]
# 5.0

强制 SKlearn 使用 OvO 策略或 OvA 策略

Sklearn 对于使用二进制分类器训练出多项分类器会自动使用 OvA 策略，除了 SVM 分类器使用 OvO 策略。

如果想让 SKlearn 使用 one-versus-one 或 one-versus-all，可以使用 OneVsOneCLassifier 或 OneVsRestClassifier类。

以强制使用 OvO 策略为例：

from sklearn.multiclass import OneVsOneClassifier
ovo_clf = OneVsOneClassifier(SGDClassifier(random_state=42))
ovo_clf.fit(X_train, y_train)
ovo_clf.predict([some_digit])
# Out:
# array([ 5.])

KNeighborsClassifier() KNN 分类器

Classifier implementing the k-nearest neighbors vote.

from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_jobs=-1, weights='distance', n_neighbors=4)
knn_clf.fit(X_train, y_train)

y_knn_pred = knn_clf.predict(X_test)

n_jobs : int, optional (default = 1); 运行 neighbors search 并行作业的数量。如果为-1，则作业数设置为 CPU 核心数。不影响fit方法

weights : str or callable, optional (default = ‘uniform’); 用于预测的权重函数，可能的值如下

keys	description
‘uniform’	uniform weights. All points in each neighborhood are weighted equally.
‘distance’	weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.
[callable]	a user-defined function which accepts an array of distances, and returns an array of the same shape containing the weights.

n_neighbors : int, optional (default = 5); 相邻膈俞，默认使用kneighbors查询。

DummyClassifier() 使用简单规则来预测的分类器

这个分类器作为一个简单的基线比较其他（真正的）分类器是有用的。不要用它来解决真正的问题。

# 纯随机分类器
from sklearn.dummy import DummyClassifier
dmy_clf = DummyClassifier()
y_probas_dmy = cross_val_predict(dmy_clf, X_train, y_train_5, cv=3, method="predict_proba")
y_scores_dmy = y_probas_dmy[:, 1]

strategy : str, default=”stratified”; 用来产生预测的策略。在 0.17 版本中，现在支持事先使用参数的先验拟合策略。

keys	description
“stratified”	generates predictions by respecting the training set’s class distribution.
“most_frequent”	always predicts the most frequent label in the training set.
“prior”	always predicts the class that maximizes the class prior (like “most_frequent”) and predict_proba returns the class prior.
“uniform”	generates predictions uniformly at random.
“constant”	always predicts a constant label that is provided by the user. This is useful for metrics that evaluate a non-majority class

random_state : int, RandomState instance or None, optional, default=None

constant : int or str or array of shape = [n_outputs]; 作为constant策略的显式常量，该参数仅在constant策略中有用。

Methods	description
fit(X, y[, sample_weight])	Fit the random classifier.
get_params([deep])	Get parameters for this estimator.
predict(X)	Perform classification on test vectors X.
predict_log_proba(X)	Return log probability estimates for the test vectors X.
predict_proba(X)	Return probability estimates for the test vectors X.
score(X, y[, sample_weight])	Returns the mean accuracy on the given test data and labels.
set_params(**params)	Set the parameters of this estimator.

accuracy_score() 精度分类评分

1 2	from sklearn.metrics import accuracy_score accuracy_score(y_test, y_knn_pred)

参数

y_true : 1d array-like, or label indicator array / sparse matrix
y_pred : 1d array-like, or label indicator array / sparse matrix
normalize : bool, optional (default=True); 如果为 False，则返回正确分类的样本数。否则，返回正确分类样本的一小部分。
sample_weight : array-like of shape = [n_samples], optional; 样本权重

返回值

score : float；如果 normalize == True，则返回正确分类的样本（float），否则返回正确分类的样本数量（int）。

SGDRegressor() SGD 回归

线性模型通过使 SGD 正则化的经验损失最小化来拟合；SGD 代表随机梯度下降

1
2
3

from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(n_iter=50, penalty=None, eta0=0.1, random_state=42)
sgd_reg.fit(X, y.ravel())

参数

n_iter : int, optional；训练数据的通过次数（又称 epochs）。默认为 None。已弃用，将在 0.21 中删除。
max_iter : int, optional；训练数据的最大通过次数（也称为 epochs）。替换n_iter参数。
penalty : str, ‘none’, ‘l2’, ‘l1’, or ‘elasticnet’；penalty 术语也叫正则化，默认为l1。
eta0 : double, optional；始学习率，默认为 0.01。
warm_start : bool, optional；设置为 True 时，重新使用先前调用fit()的解决方案以初始化，否则，只需擦除以前的解决方案。
random_state : int, RandomState instance or None, optional (default=None)；随机种子。

属性

keys	description
coef_ : array, shape (n_features,)	Weights assigned to the features.
intercept_ : array, shape (1,)	The intercept term.
averagecoef : array, shape (n_features,)	Averaged weights assigned to the features.
averageintercept : array, shape (1,)	The averaged intercept term.
niter : int	The actual number of iterations to reach the stopping criterion.

Methods	description
densify()	Convert coefficient matrix to dense array format.
fit(X, y[, coef_init, intercept_init, …])	Fit linear model with Stochastic Gradient Descent.
get_params([deep])	Get parameters for this estimator.
partial_fit(X, y[, sample_weight])	Fit linear model with Stochastic Gradient Descent.
predict(X)	Predict using the linear model
score(X, y[, sample_weight])	Returns the coefficient of determination R^2 of the prediction.
set_params(args, *kwargs)
sparsify()	Convert coefficient matrix to sparse format.

PolynomialFeatures() 生成多项式和交互特征

生成一个新的特征矩阵，该特征矩阵由度数小于或等于指定度的特征的所有多项式组合组成。例如，如果输入样本是二维的并且形式为[a，b]，则 2 次多项式特征是[1，a，b，a ^ 2，ab，b ^ 2]。

from sklearn.preprocessing import PolynomialFeatures
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
X[0]
# array([-0.75275929])

参数

degree : integer；多项式特征度数，默认值为 2。
include_bias : boolean；如果为 True（默认值），则包含一个偏置列，即所有多项式幂为 0（作为线性模型的截距项）。
interaction_only : boolean, default = False；如果为 True，则只产生相互特征。

属性

keys	description
powers_ : array, shape (n_output_features, n_input_features)	powers_[i, j] is the exponent of the jth input in the ith output.
ninput_features : int	The total number of input features.
noutput_features : int	The total number of polynomial output features. The number of output features is computed by iterating over all suitably sized combinations of input features.

Ridge() 具有 L2 正则化的线性最小二乘

这个模型解决一个使用最小二乘loss函数，使用l2-norm正则函数的回归模型。这个估计器内置了对多变量回归的支持（例如：y 是一个形状为[n_samples, n_targets]二维数组）。

from sklearn.linear_model import Ridge
ridge_reg = Ridge(alpha=1, solver="cholesky", random_state=42)
ridge_reg.fit(X, y)
ridge_reg.predict([[1.5]])
# array([[ 1.55071465]])

ridge_reg = Ridge(alpha=1, solver="sag", random_state=42)
ridge_reg.fit(X, y)
ridge_reg.predict([[1.5]])
# array([[ 1.5507201]])

参数

alpha : {float, array-like}, shape (n_targets)；正则化强度，必须为正的 float 类型。
random_state : int, RandomState instance or None, optional, default None
fit_intercept : boolean；是否计算此模型的截距。如果设置为 false，则计算中将不使用截距（例如，数据预期已居中）。

solver : {‘auto’, ‘svd’, ‘cholesky’, ‘lsqr’, ‘sparse_cg’, ‘sag’, ‘saga’}；用于计算例程的求解器。详细介绍：

keys	description
‘auto’	chooses the solver automatically based on the type of data.
‘svd’	uses a Singular Value Decomposition of X to compute the Ridge coefficients. More stable for singular matrices than ‘cholesky’.
‘cholesky’	uses the standard scipy.linalg.solve function to obtain a closed-form solution.
‘sparse_cg’	uses the conjugate gradient solver as found in scipy.sparse.linalg.cg. As an iterative algorithm, this solver is more appropriate than ‘cholesky’ for large-scale data (possibility to set tol and max_iter).
‘lsqr’	uses the dedicated regularized least-squares routine scipy.sparse.linalg.lsqr. It is the fastest but may not be available in old scipy versions. It also uses an iterative procedure.
‘sag’	uses a Stochastic Average Gradient descent, and ‘saga’ uses its improved, unbiased version named SAGA. Both methods also use an iterative procedure, and are often faster than other solvers when both n_samples and n_features are large. Note that ‘sag’ and ‘saga’ fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from sklearn.preprocessing.

属性

keys	description
coef_ : array, shape (n_features,) or (n_targets, n_features)	Weight vector(s).
intercept_ : float array, shape = (n_targets,)	Independent term in decision function. Set to 0.0 if fit_intercept = False.
niter : array or None, shape (n_targets,)	Actual number of iterations for each target. Available only for sag and lsqr solvers. Other solvers will return None.New in version 0.17.

Lasso() 用 L1 预先正则化的线性模型

Lasso 优化目标
(1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1

from sklearn.linear_model import Lasso
lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X, y)
lasso_reg.predict([[1.5]])

alpha : float, optional；乘以 L1 项的常数。
fit_intercept : boolean；是否计算这个模型的截距，如果设置为 False，则不会计算截距（例如数据已居中）。
random_state : int, RandomState instance or None, optional, default None

ElasticNet() 结合 L1、L2 作为预先正则化的线性回归

from sklearn.linear_model import ElasticNet
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5, random_state=42)
elastic_net.fit(X, y)
elastic_net.predict([[1.5]])

alpha : float, optional；乘以惩罚项的常数。
l1_ratio : float；ElasticNet 混合参数。
random_state : int, RandomState instance or None, optional, default None

clone() 用相同的参数构造一个新的估计器。

from sklearn.base import clone
sgd_reg = SGDRegressor(n_iter=1, warm_start=True, penalty=None,
                       learning_rate="constant", eta0=0.0005, random_state=42)

minimum_val_error = float("inf")
best_epoch = None
best_model = None
for epoch in range(1000):
    sgd_reg.fit(X_train_poly_scaled, y_train)  # continues where it left off
    y_val_predict = sgd_reg.predict(X_val_poly_scaled)
    val_error = mean_squared_error(y_val_predict, y_val)
    if val_error < minimum_val_error:
        minimum_val_error = val_error
        best_epoch = epoch
        best_model = clone(sgd_reg)

estimator : estimator object, or list, tuple or set of objects；要复制的估计器对象

datasets() 下载常用的数据集

details

下载鸢尾花数据集

1 2	from sklearn import datasets iris = datasets.load_iris()

LogisticRegression() Logistic 回归

1
2
3

from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X, y)

penalty : str, ‘l1’ or ‘l2’, default: ‘l2’
dual : bool, default: False；Dual 或原始公式，Dual 公式只适用于使用 L2 惩罚的线性求解器，当 n_samples > n_features 优先使用 dual=False。
tol : float, default: 1e-4；对停止的容忍标准。
C : float, default: 1.0；正则化强度的反转，必须是一个正值，就像在支持向量机中一样，较小的值指定更强的正则化。