sklearn调参与数据预处理

本文讲解一些常见处理数据的方法。

Hyperparameter Search

Grid Search 更适合微调，当超参数组合多、搜索空间大时更适合使用 Randomized Search。

GridSearchCV() 对估算器指定参数值进行详尽搜索

from sklearn.model_selection import GridSearchCV

param_grid = [
    # try 12 (3×4) combinations of hyperparameters
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    # then try 6 (2×3) combinations with bootstrap set as False
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
  ]

forest_reg = RandomForestRegressor(random_state=42)
# train across 5 folds, that's a total of (12+6)*5=90 rounds of training
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error')
grid_search.fit(housing_prepared, housing_labels)

grid_search.best_params_
# Out:{'max_features': 8, 'n_estimators': 30}

grid_search.best_estimator_
# Out:
# RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
#            max_features=8, max_leaf_nodes=None, min_impurity_decrease=0.0,
#            min_impurity_split=None, min_samples_leaf=1,
#            min_samples_split=2, min_weight_fraction_leaf=0.0,
#            n_estimators=30, n_jobs=1, oob_score=False, random_state=42,
#            verbose=0, warm_start=False)

estimator : estimator object.每个估算器需要提供一个score函数或填写scoring参数。
param_grid : dict or list of dictionaries，键作为参数名称，list 作为参数的字典。或存有这样的字典的列表。
scoring : string, callable, list/tuple, dict or None, default: None，
cv : int, cross-validation generator or an iterable, optional，如果是整数，则代表 KFold

refit : boolean, or string, default=True，应用已找到的最好的参数到整个数据集上。

Methods	description
decision_function(X)	Call decision_function on the estimator with the best found parameters.
fit(X[, y, groups])	Run fit with all sets of parameters.
get_params([deep])	Get parameters for this estimator.
inverse_transform(Xt)	Call inverse_transform on the estimator with the best found params.
predict(X)	Call predict on the estimator with the best found parameters.
predict_log_proba(X)	Call predict_log_proba on the estimator with the best found parameters.
predict_proba(X)	Call predict_proba on the estimator with the best found parameters.
score(X[, y])	Returns the score on the given data, if the estimator has been refit.
set_params(**params)	Set the parameters of this estimator.
transform(X)	Call transform on the estimator with the best found parameters.

RandomizedSearchCV()

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
        'n_estimators': randint(low=1, high=200),
        'max_features': randint(low=1, high=8),
    }

forest_reg = RandomForestRegressor(random_state=42)
rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,
                                n_iter=10, cv=5, scoring='neg_mean_squared_error', random_state=42)
rnd_search.fit(housing_prepared, housing_labels)

estimator : estimator object.指定估算器对象。
param_distributions : dict，给定以参数名为键，list 为参数的字典。或提供一个分布，分布必须提供一个rvs方法进行采样，例如来自 scipy.stats.distributions 的方法。
n_iter : int, default=10，采样参数设置数量。
scoring : string, callable, list/tuple, dict or None, default: None
cv : int, cross-validation generator or an iterable, optional
refit : boolean, or string default=True
random_state : int, RandomState instance or None, optional, default=None

Transformation Pipelines

执行 pipeline 结构的fit()方法时，依次执行 transformer 的fit_transform()方法，最后一个 transformer 执行fit()方法，并将前一个 transformer 返回的值作为参数输入给后一个 transformer。通过 FeatureUnion 类向已有的 pipeline 加入子 pipeline。每个字 pipeline 以 selector 开头，目的是挑选出主要属性，并转换为 NumPy array。

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
        ('imputer', Imputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])

housing_num_tr = num_pipeline.fit_transform(housing_num)

steps : list；由（名称，变换）组成的元组列表，列表最后一个对象是一个估计器。

memory : None, str or object with the joblib.Memory interface, optional

Methods	description
decision_function(X)	Apply transforms, and decision_function of the final estimator
fit(X[, y])	Fit the model
fit_predict(X[, y])	Applies fit_predict of last step in pipeline after transforms.
fit_transform(X[, y])	Fit the model and transform with the final estimator
get_params([deep])	Get parameters for this estimator.
predict(X)	Apply transforms to the data, and predict with the final estimator
predict_log_proba(X)	Apply transforms, and predict_log_proba of the final estimator
predict_proba(X)	Apply transforms, and predict_proba of the final estimator
score(X[, y, sample_weight])	Apply transforms, and score with the final estimator
set_params(**kwargs)	Set the parameters of this estimator.

用 subpipeline 构建一个 fullpipeline

这需要用到FeatureUnion类。

from sklearn.pipeline import FeatureUnion
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

num_pipeline = Pipeline([
        ('selector', DataFrameSelector(num_attribs)),
        ('imputer', Imputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])

cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('cat_encoder', CategoricalEncoder(encoding="onehot-dense")),
    ])


full_pipeline = FeatureUnion(transformer_list=[
        ("num_pipeline", num_pipeline),
        ("cat_pipeline", cat_pipeline),
    ])

housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared

FeatureUnion 类链接多个 transdormer 对象的结果

transformer_list : list of (string, transformer) tuples
n_jobs : int, optional；并行作业数，默认为 1.
transformer_weights : dict, optional；每个 transformer 的乘法权重，键为 transformer 名称，值为权重。

DataFrameSeletor 示例

from sklearn.base import BaseEstimator, TransformerMixin

# Create a class to select numerical or categorical columns
# since Scikit-Learn doesn't handle DataFrames yet
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

特征放缩

MinMaxScaler() MinMax scaling 归一化

该方法更容易受离散点影响
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))

X_scaled = X_std * (max - min) + min
>>> from sklearn.preprocessing import MinMaxScaler
>>>
>>> data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
>>> scaler = MinMaxScaler()
>>> print(scaler.fit(data))
MinMaxScaler(copy=True, feature_range=(0, 1))
>>> print(scaler.data_max_)
[  1.  18.]
>>> print(scaler.transform(data))
[[ 0.    0.  ]
 [ 0.25  0.25]
 [ 0.5   0.5 ]
 [ 1.    1.  ]]
>>> print(scaler.transform([[2, 2]]))
[[ 1.5  0. ]]

feature_range : tuple (min, max), default=(0, 1)，归一化后值的范围
copy : boolean, optional, default True，是否复制数据在新的数据上归一化

StandardScaler() 0 均值标准化

>>> from sklearn.preprocessing import StandardScaler
>>>
>>> data = [[0, 0], [0, 0], [1, 1], [1, 1]]
>>> scaler = StandardScaler()
>>> print(scaler.fit(data))
StandardScaler(copy=True, with_mean=True, with_std=True)
>>> print(scaler.mean_)
[ 0.5  0.5]
>>> print(scaler.transform(data))
[[-1. -1.]
 [-1. -1.]
 [ 1.  1.]
 [ 1.  1.]]
>>> print(scaler.transform([[2, 2]]))
[[ 3.  3.]]

copy : boolean, optional, default True，是否复制数据在新的数据上执行
with_mean : boolean, True by default，若为 True 则在缩放前将数据居中。但在稀疏矩阵上是行不通的。
with_std : boolean, True by default，若为 True，则将数据放缩到单位方差或等效于单位标准差

处理空数据

Imputer() 处理丢失值

各属性必须是数值

from sklearn.preprocessing import Imputer
# 指定用何值替换丢失的值，此处为中位数
imputer = Imputer(strategy="median")

# 使实例适应数据
imputer.fit(housing_num)

# 结果在statistics_ 变量中
imputer.statistics_

# 替换
X = imputer.transform(housing_num)
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
                          index = list(housing.index.values))

# 预览
housing_tr.loc[sample_incomplete_rows.index.values]

处理文本数据

pandas.factorize() 将输入值编码为枚举类型或分类变量

housing_cat = housing['ocean_proximity']
housing_cat.head(10)
# 输出
# 17606     <1H OCEAN
# 18632     <1H OCEAN
# 14650    NEAR OCEAN
# 3230         INLAND
# 3555      <1H OCEAN
# 19480        INLAND
# 8879      <1H OCEAN
# 13685        INLAND
# 4937      <1H OCEAN
# 4861      <1H OCEAN
# Name: ocean_proximity, dtype: object

housing_cat_encoded, housing_categories = housing_cat.factorize()
housing_cat_encoded[:10]
# 输出
# array([0, 0, 1, 2, 0, 2, 0, 2, 0, 0], dtype=int64)

参数

values : ndarray (1-d)；序列
sort : boolean, default False；根据值排序
na_sentinel : int, default -1；给未找到赋的值
size_hint : hint to the hashtable sizer

返回值

labels : the indexer to the original array
uniques : ndarray (1-d) or Index；当传递的值是 Index 或 Series 时，返回独特的索引。

OneHotEncoder 编码整数特征为 one-hot 向量

返回值为稀疏矩阵

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()
housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1,1))
housing_cat_1hot

注意fit_transform()期望一个二维数组，所以这里将数据 reshape 了。

处理文本特征示例

housing_cat = housing['ocean_proximity']
housing_cat.head(10)
# 17606     <1H OCEAN
# 18632     <1H OCEAN
# 14650    NEAR OCEAN
# 3230         INLAND
# 3555      <1H OCEAN
# 19480        INLAND
# 8879      <1H OCEAN
# 13685        INLAND
# 4937      <1H OCEAN
# 4861      <1H OCEAN
# Name: ocean_proximity, dtype: object

housing_cat_encoded, housing_categories = housing_cat.factorize()
housing_cat_encoded[:10]
# array([0, 0, 1, 2, 0, 2, 0, 2, 0, 0], dtype=int64)

housing_categories
# Index(['<1H OCEAN', 'NEAR OCEAN', 'INLAND', 'NEAR BAY', 'ISLAND'], dtype='object')

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()
print(housing_cat_encoded.reshape(-1,1))
housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1,1))
housing_cat_1hot
# [[0]
#  [0]
#  [1]
#  ...,
#  [2]
#  [0]
#  [3]]
# <16512x5 sparse matrix of type '<class 'numpy.float64'>'
# 	with 16512 stored elements in Compressed Sparse Row format>

LabelEncoder 标签编码

LabelEncoder`是一个可以用来将标签规范化的工具类，它可以将标签的编码值范围限定在[0,n_classes-1]。简单来说就是对不连续的数字或者文本进行编号。

>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit([1, 2, 2, 6])
LabelEncoder()
>>> le.classes_
array([1, 2, 6])
>>> le.transform([1, 1, 2, 6])
array([0, 0, 1, 2])
>>> le.inverse_transform([0, 0, 1, 2])
array([1, 1, 2, 6])

当然，它也可以用于非数值型标签的编码转换成数值标签（只要它们是可哈希并且可比较的）:

>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder()
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris"])
array([2, 2, 1])
>>> list(le.inverse_transform([2, 2, 1]))
['tokyo', 'tokyo', 'paris']

LabelBinarizer 标签二值化

LabelBinarizer 是一个用来从多类别列表创建标签矩阵的工具类:

>>> from sklearn import preprocessing
>>> lb = preprocessing.LabelBinarizer()
>>> lb.fit([1, 2, 6, 4, 2])
LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)
>>> lb.classes_
array([1, 2, 4, 6])
>>> lb.transform([1, 6])
array([[1, 0, 0, 0],
       [0, 0, 0, 1]])

对于多类别是实例，可以使用:class:MultiLabelBinarizer:

>>> lb = preprocessing.MultiLabelBinarizer()
>>> lb.fit_transform([(1, 2), (3,)])
array([[1, 1, 0],
       [0, 0, 1]])
>>> lb.classes_
array([1, 2, 3])