sklearn调参与数据预处理

发布 : 2020-01-23 分类 : 数据科学 浏览 :

本文讲解一些常见处理数据的方法。

Hyperparameter Search

Grid Search 更适合微调,当超参数组合多、搜索空间大时更适合使用 Randomized Search。

GridSearchCV() 对估算器指定参数值进行详尽搜索

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from sklearn.model_selection import GridSearchCV

param_grid = [
# try 12 (3×4) combinations of hyperparameters
{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
# then try 6 (2×3) combinations with bootstrap set as False
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]

forest_reg = RandomForestRegressor(random_state=42)
# train across 5 folds, that's a total of (12+6)*5=90 rounds of training
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
scoring='neg_mean_squared_error')
grid_search.fit(housing_prepared, housing_labels)

grid_search.best_params_
# Out:{'max_features': 8, 'n_estimators': 30}

grid_search.best_estimator_
# Out:
# RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
# max_features=8, max_leaf_nodes=None, min_impurity_decrease=0.0,
# min_impurity_split=None, min_samples_leaf=1,
# min_samples_split=2, min_weight_fraction_leaf=0.0,
# n_estimators=30, n_jobs=1, oob_score=False, random_state=42,
# verbose=0, warm_start=False)
  • estimator : estimator object.每个估算器需要提供一个score函数或填写scoring参数。
  • param_grid : dict or list of dictionaries,键作为参数名称,list 作为参数的字典。或存有这样的字典的列表。
  • scoring : string, callable, list/tuple, dict or None, default: None,
  • cv : int, cross-validation generator or an iterable, optional,如果是整数,则代表 KFold
  • refit : boolean, or string, default=True,应用已找到的最好的参数到整个数据集上。
    Methods description
    decision_function(X) Call decision_function on the estimator with the best found parameters.
    fit(X[, y, groups]) Run fit with all sets of parameters.
    get_params([deep]) Get parameters for this estimator.
    inverse_transform(Xt) Call inverse_transform on the estimator with the best found params.
    predict(X) Call predict on the estimator with the best found parameters.
    predict_log_proba(X) Call predict_log_proba on the estimator with the best found parameters.
    predict_proba(X) Call predict_proba on the estimator with the best found parameters.
    score(X[, y]) Returns the score on the given data, if the estimator has been refit.
    set_params(**params) Set the parameters of this estimator.
    transform(X) Call transform on the estimator with the best found parameters.

RandomizedSearchCV()

1
2
3
4
5
6
7
8
9
10
11
12
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
'n_estimators': randint(low=1, high=200),
'max_features': randint(low=1, high=8),
}

forest_reg = RandomForestRegressor(random_state=42)
rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,
n_iter=10, cv=5, scoring='neg_mean_squared_error', random_state=42)
rnd_search.fit(housing_prepared, housing_labels)
  • estimator : estimator object.指定估算器对象。
  • param_distributions : dict,给定以参数名为键,list 为参数的字典。或提供一个分布,分布必须提供一个rvs方法进行采样,例如来自 scipy.stats.distributions 的方法。
  • n_iter : int, default=10,采样参数设置数量。
  • scoring : string, callable, list/tuple, dict or None, default: None
  • cv : int, cross-validation generator or an iterable, optional
  • refit : boolean, or string default=True
  • random_state : int, RandomState instance or None, optional, default=None

Transformation Pipelines

执行 pipeline 结构的fit()方法时,依次执行 transformer 的fit_transform()方法,最后一个 transformer 执行fit()方法,并将前一个 transformer 返回的值作为参数输入给后一个 transformer。通过 FeatureUnion 类向已有的 pipeline 加入子 pipeline。每个字 pipeline 以 selector 开头,目的是挑选出主要属性,并转换为 NumPy array。

1
2
3
4
5
6
7
8
9
10
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
('imputer', Imputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
])

housing_num_tr = num_pipeline.fit_transform(housing_num)
  • steps : list;由(名称, 变换)组成的元组列表,列表最后一个对象是一个估计器。
  • memory : None, str or object with the joblib.Memory interface, optional
    Methods description
    decision_function(X) Apply transforms, and decision_function of the final estimator
    fit(X[, y]) Fit the model
    fit_predict(X[, y]) Applies fit_predict of last step in pipeline after transforms.
    fit_transform(X[, y]) Fit the model and transform with the final estimator
    get_params([deep]) Get parameters for this estimator.
    predict(X) Apply transforms to the data, and predict with the final estimator
    predict_log_proba(X) Apply transforms, and predict_log_proba of the final estimator
    predict_proba(X) Apply transforms, and predict_proba of the final estimator
    score(X[, y, sample_weight]) Apply transforms, and score with the final estimator
    set_params(**kwargs) Set the parameters of this estimator.

用 subpipeline 构建一个 fullpipeline

这需要用到FeatureUnion类。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from sklearn.pipeline import FeatureUnion
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

num_pipeline = Pipeline([
('selector', DataFrameSelector(num_attribs)),
('imputer', Imputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
])

cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('cat_encoder', CategoricalEncoder(encoding="onehot-dense")),
])


full_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),
])

housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared

FeatureUnion 类链接多个 transdormer 对象的结果

  • transformer_list : list of (string, transformer) tuples
  • n_jobs : int, optional;并行作业数,默认为 1.
  • transformer_weights : dict, optional;每个 transformer 的乘法权重,键为 transformer 名称,值为权重。

DataFrameSeletor 示例

1
2
3
4
5
6
7
8
9
10
11
from sklearn.base import BaseEstimator, TransformerMixin

# Create a class to select numerical or categorical columns
# since Scikit-Learn doesn't handle DataFrames yet
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values

特征放缩

MinMaxScaler() MinMax scaling 归一化

该方法更容易受离散点影响
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
X_scaled = X_std * (max - min) + min
>>> from sklearn.preprocessing import MinMaxScaler
>>>
>>> data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
>>> scaler = MinMaxScaler()
>>> print(scaler.fit(data))
MinMaxScaler(copy=True, feature_range=(0, 1))
>>> print(scaler.data_max_)
[ 1. 18.]
>>> print(scaler.transform(data))
[[ 0. 0. ]
[ 0.25 0.25]
[ 0.5 0.5 ]
[ 1. 1. ]]
>>> print(scaler.transform([[2, 2]]))
[[ 1.5 0. ]]
  • feature_range : tuple (min, max), default=(0, 1),归一化后值的范围
  • copy : boolean, optional, default True,是否复制数据在新的数据上归一化

StandardScaler() 0 均值标准化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
>>> from sklearn.preprocessing import StandardScaler
>>>
>>> data = [[0, 0], [0, 0], [1, 1], [1, 1]]
>>> scaler = StandardScaler()
>>> print(scaler.fit(data))
StandardScaler(copy=True, with_mean=True, with_std=True)
>>> print(scaler.mean_)
[ 0.5 0.5]
>>> print(scaler.transform(data))
[[-1. -1.]
[-1. -1.]
[ 1. 1.]
[ 1. 1.]]
>>> print(scaler.transform([[2, 2]]))
[[ 3. 3.]]
  • copy : boolean, optional, default True,是否复制数据在新的数据上执行
  • with_mean : boolean, True by default,若为 True 则在缩放前将数据居中。但在稀疏矩阵上是行不通的。
  • with_std : boolean, True by default,若为 True,则将数据放缩到单位方差或等效于单位标准差

处理空数据

Imputer() 处理丢失值

各属性必须是数值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from sklearn.preprocessing import Imputer
# 指定用何值替换丢失的值,此处为中位数
imputer = Imputer(strategy="median")

# 使实例适应数据
imputer.fit(housing_num)

# 结果在statistics_ 变量中
imputer.statistics_

# 替换
X = imputer.transform(housing_num)
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
index = list(housing.index.values))

# 预览
housing_tr.loc[sample_incomplete_rows.index.values]

处理文本数据

pandas.factorize() 将输入值编码为枚举类型或分类变量

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
housing_cat = housing['ocean_proximity']
housing_cat.head(10)
# 输出
# 17606 <1H OCEAN
# 18632 <1H OCEAN
# 14650 NEAR OCEAN
# 3230 INLAND
# 3555 <1H OCEAN
# 19480 INLAND
# 8879 <1H OCEAN
# 13685 INLAND
# 4937 <1H OCEAN
# 4861 <1H OCEAN
# Name: ocean_proximity, dtype: object

housing_cat_encoded, housing_categories = housing_cat.factorize()
housing_cat_encoded[:10]
# 输出
# array([0, 0, 1, 2, 0, 2, 0, 2, 0, 0], dtype=int64)
参数
  • values : ndarray (1-d);序列
  • sort : boolean, default False;根据值排序
  • na_sentinel : int, default -1;给未找到赋的值
  • size_hint : hint to the hashtable sizer
返回值
  • labels : the indexer to the original array
  • uniques : ndarray (1-d) or Index;当传递的值是 Index 或 Series 时,返回独特的索引。

OneHotEncoder 编码整数特征为 one-hot 向量

返回值为稀疏矩阵

1
2
3
4
5
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()
housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1,1))
housing_cat_1hot

注意fit_transform()期望一个二维数组,所以这里将数据 reshape 了。

处理文本特征示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
housing_cat = housing['ocean_proximity']
housing_cat.head(10)
# 17606 <1H OCEAN
# 18632 <1H OCEAN
# 14650 NEAR OCEAN
# 3230 INLAND
# 3555 <1H OCEAN
# 19480 INLAND
# 8879 <1H OCEAN
# 13685 INLAND
# 4937 <1H OCEAN
# 4861 <1H OCEAN
# Name: ocean_proximity, dtype: object

housing_cat_encoded, housing_categories = housing_cat.factorize()
housing_cat_encoded[:10]
# array([0, 0, 1, 2, 0, 2, 0, 2, 0, 0], dtype=int64)

housing_categories
# Index(['<1H OCEAN', 'NEAR OCEAN', 'INLAND', 'NEAR BAY', 'ISLAND'], dtype='object')

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()
print(housing_cat_encoded.reshape(-1,1))
housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1,1))
housing_cat_1hot
# [[0]
# [0]
# [1]
# ...,
# [2]
# [0]
# [3]]
# <16512x5 sparse matrix of type '<class 'numpy.float64'>'
# with 16512 stored elements in Compressed Sparse Row format>

LabelEncoder 标签编码

LabelEncoder`是一个可以用来将标签规范化的工具类,它可以将标签的编码值范围限定在[0,n_classes-1]。简单来说就是对不连续的数字或者文本进行编号。

1
2
3
4
5
6
7
8
9
10
>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit([1, 2, 2, 6])
LabelEncoder()
>>> le.classes_
array([1, 2, 6])
>>> le.transform([1, 1, 2, 6])
array([0, 0, 1, 2])
>>> le.inverse_transform([0, 0, 1, 2])
array([1, 1, 2, 6])

当然,它也可以用于非数值型标签的编码转换成数值标签(只要它们是可哈希并且可比较的):

1
2
3
4
5
6
7
8
9
>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder()
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris"])
array([2, 2, 1])
>>> list(le.inverse_transform([2, 2, 1]))
['tokyo', 'tokyo', 'paris']

LabelBinarizer 标签二值化

LabelBinarizer 是一个用来从多类别列表创建标签矩阵的工具类:

1
2
3
4
5
6
7
8
9
>>> from sklearn import preprocessing
>>> lb = preprocessing.LabelBinarizer()
>>> lb.fit([1, 2, 6, 4, 2])
LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)
>>> lb.classes_
array([1, 2, 4, 6])
>>> lb.transform([1, 6])
array([[1, 0, 0, 0],
[0, 0, 0, 1]])

对于多类别是实例,可以使用:class:MultiLabelBinarizer:

1
2
3
4
5
6
>>> lb = preprocessing.MultiLabelBinarizer()
>>> lb.fit_transform([(1, 2), (3,)])
array([[1, 1, 0],
[0, 0, 1]])
>>> lb.classes_
array([1, 2, 3])
本文作者 : HeoLis
原文链接 : https://ishero.net/sklearn%E8%B0%83%E5%8F%82%E4%B8%8E%E6%95%B0%E6%8D%AE%E9%A2%84%E5%A4%84%E7%90%86.html
版权声明 : 本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明出处!

学习、记录、分享、获得

微信扫一扫, 向我投食

微信扫一扫, 向我投食