为什么sklearn提供的one-hot编码器这么慢?
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(sparse=False, handle_unknown='ignore')
categories = np.expand_dims(np.array(list(map(chr, np.arange(97,123)))), axis=1)
enc.fit(categories)
vec = np.expand_dims(list(map(chr, np.random.randint(97,123,1000))), axis=1)
%timeit enc.transform(np.array([['b']]))
%timeit (categories == 'b')[:,0] *1
输出:
569 µs ± 3.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
3.23 µs ± 58.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
这比简单的方法慢100倍,而实际上通过使用categories
的元素是唯一的知识可以更快。
1条答案
按热度按时间dgjrabp21#
以防有人在这里出错--https://datascience.stackexchange.com/questions/116582/why-does-scikit-learns-onehotencoder-take-so-long-on-a-large-dataset回答了这个问题,并建议切换到pandas.get_dummies,这要快得多