With sklearn, when you create a new KFold object and shuffle is true, it'll produce a different, newly randomized fold indices. However, every generator from a given KFold object gives the same indices for each fold even when shuffle is true. Why does it work like this?
Example:
from sklearn.cross_validation import KFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
kf = KFold(4, n_folds=2, shuffle = True)
for fold in kf:
print fold
print '---second round----'
for fold in kf:
print fold
Output:
(array([2, 3]), array([0, 1]))
(array([0, 1]), array([2, 3]))
---second round----#same indices for the folds
(array([2, 3]), array([0, 1]))
(array([0, 1]), array([2, 3]))
This question was motivated by a comment on this answer . I decided to split it into a new question to prevent that answer from becoming too long.
2条答案
按热度按时间y53ybaqx1#
具有相同KFold对象的新迭代将不会对索引进行重新洗牌,这只会在对象的示例化过程中发生。
KFold()
不会看到数据,但知道样本数,因此它使用该数据对索引进行洗牌。每次调用生成器来迭代每个折叠的索引时,它将使用相同的混洗索引并以相同的方式划分它们。
看一下KFold
_PartitionIterator(with_metaclass(ABCMeta))
的基类的code,其中定义了__iter__
。基类中的__iter__
方法调用KFold中的_iter_test_indices
来划分并产生每个折叠的训练和测试索引。ctehm74n2#
使用新版本的sklearn,通过调用
from sklearn.model_selection import KFold
,KFold生成器使用shuffle给予不同的索引:输出:
或者,下面的代码迭代生成相同的结果:
输出: