处理日期
1 2 3 4 5 6 7 8
| birth = trainData['birth_date'] birthDate = pd.to_datetime(birth) end = pd.datetime(2018, 1, 1) # 计算天数 birthDay = end - birthDate birthDay.astype('timedelta64[D]') # timedelta64 转到 int64 trainData['birth_date'] = birthDay.dt.days
|
计算多列数的平均值等
1 2
| trainData['operate_able'] = trainData.iloc[ : , 20:53].mean(axis=1) trainData['local_able'] = trainData.iloc[ : , 53:64].mean(axis=1)
|
计算特征之间的相关性并排序显示
1 2
| corr_matrix = trainData.corr() corr_matrix["y"].sort_values(ascending=False)
|
填充空值
1
| testData['gk'].fillna(testData['gk'].median(), inplace=True)
|
数据分列(对列进行 one-hot)
1 2
| train_test = pd.get_dummies(train_test,columns=["Embarked"]) train_test = pd.get_dummies(train_test,columns = ['SibSp','Parch','SibSp_Parch'])
|
正则提取指定内容
df[‘Name].str.extract()是提取函数,配合正则一起使用
1
| train_test['Name1'] = train_test['Name'].str.extract('.+,(.+)').str.extract( '^(.+?)\.').str.strip()
|
对数据进行分类替换
1 2
| train_test['Name1'].replace(['Capt', 'Col', 'Major', 'Dr', 'Rev'], 'Officer' , inplace = True) train_test['Name1'].replace(['Jonkheer', 'Don', 'Sir', 'the Countess', 'Dona', 'Lady'], 'Royalty' , inplace = True)
|
删除一列
对齐合并数据
1 2
| testData['Survived'] = 0 train_test = trainData.append(testData)
|
根据数据是否缺失进行处理
1 2
| train_test.loc[train_test["Age"].isnull() ,"age_nan"] = 1 train_test.loc[train_test["Age"].notnull() ,"age_nan"] = 0
|
利用 loc 将预测值填入数据集
1
| train_test.loc[(train_test['Age'].isnull()), 'Age'] = lin.predict(missing_age_X_test)
|
按区间分割(pd.cut)
返回 x 所属区间的索引值,半开区间
1 2
| #将年龄划分四个阶段10以下,10-18,18-30,30-50,50以上 train_test['Age'] = pd.cut(train_test['Age'], bins=[0,10,18,30,50,100],labels=[1,2,3,4,5])
|
标准化
线性模型需要用标准化的数据建模,而树类模型不需要标准化的数据
处理标准化的时候,注意将测试集的数据 transform 到 test 集上
StandardScaler
1 2 3 4 5
| from sklearn.preprocessing import StandardScaler ss2 = StandardScaler() ss2.fit(train_data_X) train_data_X_sd = ss2.transform(train_data_X) test_data_X_sd = ss2.transform(test_data_X)
|