Pandas df的LabelEncoder拟合顺序

qvk1mo1f 于 12个月前发布在其他

关注(0)|答案(6)|浏览(135)

我在pandas df的一个列上安装了一个scikit-learn LabelEncoder。
遇到的字符串Map到整数的顺序是如何确定的？它是确定性的吗？
更重要的是，我可以指定这个命令吗？

import pandas as pd
from sklearn import preprocessing

df = pd.DataFrame(data=["first", "second", "third", "fourth"], columns=['x'])
le = preprocessing.LabelEncoder()
le.fit(df['x'])
print list(le.classes_)
### this prints ['first', 'fourth', 'second', 'third']
encoded = le.transform(["first", "second", "third", "fourth"]) 
print encoded
### this prints [0 2 3 1]

我希望le.classes_是["first", "second", "third", "fourth"]，然后encoded是[0 1 2 3]，因为这是字符串在列中出现的顺序。这能做到吗？

pandas

来源：https://stackoverflow.com/questions/38749305/labelencoder-order-of-fit-for-a-pandas-df

6条答案

按热度按时间

vktxenjb1#

它是按排序完成的。在字符串的情况下，它是按字母顺序完成的。没有相关的文档，但是查看LabelEncoder.transform的源代码，我们可以看到这项工作主要委托给函数numpy.setdiff1d，文档如下：
求两个数组的集合差。
返回ar 1中不存在于ar 2中的sorted的唯一值。
（重点是我的）。
请注意，由于这没有文档记录，因此它可能是实现定义的，并且可以在版本之间更改。可能只是我看到的版本使用了排序顺序，其他版本的scikit-learn可能会改变这种行为（不使用numpy.setdiff1d）。

赞(0）回复(0）举报 12个月前

ycl3bljg2#

我也有点惊讶，我不能提供一个订单给LabelEncoder。一行解决方案可以是这样的：

df['col1_num'] = df['col1'].apply(lambda x: ['first', 'second', 'third', 'fourth'].index(x))

赞(0）回复(0）举报 12个月前

inkz8wg93#

我想为我的一个应用程序指定LabelEncoder的顺序。因为我不想迁移一些代码和使用一些其他的库。我设法实现了一个临时的解决方案。
由于一开始我就知道数据集中的类别，所以我创建了虚拟类别，这些类别将按照我想要的特定顺序按字母顺序排序。例如

{
0:ARejected,
1:ZApproved
}

之后，我在数据集上安装了标签编码器。一旦它被安装，我替换标签编码器的类，以确保在未来它Map的标签，因为我想要它。

le=LabelEncoder()
le.fit (X)
le.classes = np.array(['Rejected,'Approved'])

这可能会在特定情况下帮助一些人。这肯定不是永久的解决方案，因为当再次安装编码器时，信息可能会丢失。或者如果类别的数量太大。

赞(0）回复(0）举报 12个月前

xytpbqjk4#

已经七年了，但我现在需要它。所以你会发现一个类似于以斯拉的方法的一行解决方案：

`import pandas as pd
from sklearn import preprocessing
df = pd.DataFrame(data=["first", "second", "third", "fourth"], columns=['x'])
le = preprocessing.LabelEncoder()

# ONE LINE SOLUTION, approach similar to Ezra's, but without fitting the encoder
le.classes_ = df['x']   # after this line I got the desired result

print list(le.classes_)
### this prints ['first', 'second', 'third', 'fourth']

encoded = le.transform(["first", "second", "third", "fourth"]) 
print(encoded)
### this prints [0 1 2 3 ]`

赞(0）回复(0）举报 12个月前

xoshrz7s5#

我建议您使用category_encoders包中的OrdinalEncoder。它有一个 mapping 参数，您可以在其中为每个类别设置所需的转换。您可以在documentation中阅读更多信息。
下面是一个实现示例：

from category_encoders import OrdinalEncoder

# Ascending order according to value counts
keys = df.columnName.value_counts().sort_values(ascending=False).index
values = list(range(len(keys))) # do np.array()+1 in case you want it to start with 1
mapping = [{
    'col': 'columnName',
    'mapping': dict(zip(keys, values))
}]
oe = OrdinalEncoder(cols=['columnName'], mapping=mapping)
df.columnName = oe.fit_transform(df).columnName # Read note

注意：我建议使用这种编码方式，因为编码器在更改其他列的dtype时可能会导致问题。

赞(0）回复(0）举报 12个月前

jgovgodb6#

另一种方法是更改标签编码器的**classes_**属性。

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(["first", "second", "third", "fourth"])
le.classes_ = np.array(["first", "second", "third", "fourth"])
le.transform(["first", "second", "third", "fourth"])

上面的代码以与le.classes_相同的顺序设置标签，其输出为：

array([0, 1, 2, 3])

赞(0）回复(0）举报 12个月前

我来回答

Pandas df的LabelEncoder拟合顺序

6条答案

相关问题

热门标签

最新问答