为了将数据集分为train、validation和test,我尝试使用以下函数:
X = df.drop("RISK DECISION", axis = 1).values
y = df["RISK DECISION"].values
def train_validation_test_split(X, y, validation_size = 0.1, test_size = 0.1, random_state = None):
if random_state != None:
np.random.seed(random_state)
n = X.shape[0]
validation_indices = np.random.choice(n, int(n*validation_size), replace = False)
test_indices = np.random.choice(n, int(n*test_size), replace = False)
all_indices = np.concatenate((validation_indices, test_indices))
X_validation = X[validation_indices]
y_validation = y[validation_indices]
X_test = X[test_indices]
y_test = y[test_indices]
X_train = np.delete(X, all_indices, axis = 0)
y_train = np.delete(y, all_indices, axis = 0)
return(X_train, X_validation, X_test, y_train, y_validation, y_test)
初始数据集的长度为:34322条记录
应用函数后,X_train、X_validation和X_test的长度之和大于数据集的初始长度。
问题可能由np.random.choice
给出,变量validation_indices
和test_indices
可能包含一些相同的索引。
如何在不过多修改train_validation_test_split
函数的情况下解决这个问题?
1条答案
按热度按时间8ehkhllq1#
是的,重复的索引确实是个问题。一个简单的解决方案是将所有的索引放在一起,然后将结果分为验证和测试
既然numpy.random.choice似乎是以随机的顺序给出索引,这应该没问题。只是可能包含一个Assert来确保validation_size+test_size〈1(或者可能更健壮的n_validation+n_test〈=n)