Pandas线性回归:仅对非分类值使用标准化(StandardScaler)

icnyk63a  于 2022-10-23  发布在  其他
关注(0)|答案(1)|浏览(158)

我有以下数据集,我正在将其读入Pandas Dataframe :

age gender  bmi     smoker  married region  value
39  female  23.0    yes     no      us      136
28  male    22.0    no      no      us      143
23  male    34.0    no      yes     europe  153
17  male    29.0    no      no      asia    162

性别、吸烟者和地区是分类价值。所以我将它们转换(使用replace函数表示性别和吸烟者,使用一个hot编码表示区域。结果如下:

age sex bmi     smoker  married value r_asia r_europe r_us
39  1   23.0    1       0       136   0      0        1
28  0   22.0    0       0       143   0      0        1
23  0   34.0    0       1       153   0      1        0
17  0   29.0    0       0       162   1      0        0

然后我将其分为功能和目标

y = dataset['value'].values
X = dataset.drop('value',axis=1).values

接下来,我将分为训练和测试集:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

作为下一步,我想正常化。通常我会这样做:

scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)

然而,这也使分类值正常化。我只想规范化非分类值(在本例中,唯一的非分类值是“bmi”)。
如何仅规范化“bmi”列并将这些规范化值插入X_train和X_test?

rslzwgfq

rslzwgfq1#

train_test_split返回NumPy数组的列表(X_train是其中之一),而不是 Dataframe ,因此X_train["bmi"]引发异常。StandardScaler也一样。

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
print(X_train)

# Output <class 'numpy.ndarray'>

[[28 'male' 22.0 'no' 'no' 'us']
 [39 'female' 23.0 'yes' 'no' 'us']]

因此,有一种方法可以做到这一点:


# Back to Pandas

X_train = pd.DataFrame(X_train)

# Fit and transform the target column (2 == "bmi")

scaler = StandardScaler()
scaler.fit(X_train.loc[:, 2].to_numpy().reshape(-1, 1))
X_train[2] = scaler.transform(X_train.loc[:, 2].to_numpy().reshape(-1, 1))

# Revert to Numpy

X_train = X_train.to_numpy()
print(X_train)

# Output

[[28 'male' -1.0 'no' 'no' 'us']
 [39 'female' 1.0 'yes' 'no' 'us']]

相关问题