我创建了这个函数,它接受一个dataframe
来返回一个ndarrays
的输入和标签。
def transform_to_array(dataframe, chunk_size=100):
grouped = dataframe.groupby('id')
# initialize accumulators
X, y = np.zeros([0, 1, chunk_size, 4]), np.zeros([0,]) # original inpt shape: [0, 1, chunk_size, 4]
# loop over each group (df[df.id==1] and df[df.id==2])
for _, group in grouped:
inputs = group.loc[:, 'A':'D'].values
label = group.loc[:, 'label'].values[0]
# calculate number of splits
N = (len(inputs)-1) // chunk_size
if N > 0:
inputs = np.array_split(
inputs, [chunk_size + (chunk_size*i) for i in range(N)])
else:
inputs = [inputs]
# loop over splits
for inpt in inputs:
inpt = np.pad(
inpt, [(0, chunk_size-len(inpt)),(0, 0)],
mode='constant')
# add each inputs split to accumulators
X = np.concatenate([X, inpt[np.newaxis, np.newaxis]], axis=0)
y = np.concatenate([y, label[np.newaxis]], axis=0)
return X, y
函数返回了形状(n_samples, 1, chunk_size, 4)
的X
和形状(n_samples, )
的y
。
例如:
N = 10_000
id = np.arange(N)
labels = np.random.randint(5, size=N)
df = pd.DataFrame(data = np.random.randn(N, 4), columns=list('ABCD'))
df['label'] = labels
df.insert(0, 'id', id)
df = df.loc[df.id.repeat(157)]
df.head()
id A B C D label
0 0 -0.571676 -0.337737 -0.019276 -1.377253 1
0 0 -0.571676 -0.337737 -0.019276 -1.377253 1
0 0 -0.571676 -0.337737 -0.019276 -1.377253 1
0 0 -0.571676 -0.337737 -0.019276 -1.377253 1
0 0 -0.571676 -0.337737 -0.019276 -1.377253 1
生成以下内容:
X, y = transform_to_array(df)
X.shape # shape of input
(20000, 1, 100, 4)
y.shape # shape of label
(20000,)
此函数按预期正常工作,但需要很长时间才能完成执行:
start_time = time.time()
X, y = transform_to_array(df)
end_time = time.time()
print(f'Time taken: {end_time - start_time} seconds.')
Time taken: 227.83956217765808 seconds.
为了提高函数的性能(最小化执行时间),我创建了以下修改后的函数:
def modified_transform_to_array(dataframe, chunk_size=100):
# group data by 'id'
grouped = dataframe.groupby('id')
# initialize lists to store transformed data
X, y = [], []
# loop over each group (df[df.id==1] and df[df.id==2])
for _, group in grouped:
# get input and label data for group
inputs = group.loc[:, 'A':'D'].values
label = group.loc[:, 'label'].values[0]
# calculate number of splits
N = (len(inputs)-1) // chunk_size
if N > 0:
# split input data into chunks
inputs = np.array_split(
inputs, [chunk_size + (chunk_size*i) for i in range(N)])
else:
inputs = [inputs]
# loop over splits
for inpt in inputs:
# pad input data to have a chunk size of chunk_size
inpt = np.pad(
inpt, [(0, chunk_size-len(inpt)),(0, 0)],
mode='constant')
# add each input split and corresponding label to lists
X.append(inpt)
y.append(label)
# convert lists to numpy arrays
X = np.array(X)
y = np.array(y)
return X, y
起初,我似乎成功地缩短了所用的时间:
start_time = time.time()
X2, y2 = modified_transform_to_array(df)
end_time = time.time()
print(f'Time taken: {end_time - start_time} seconds.')
Time taken: 5.842168092727661 seconds.
但是,结果是它改变了预期返回值的形状。
X2.shape # this should be (20000, 1, 100, 4)
(20000, 100, 4)
y.shape # this is fine
(20000, )
- 问题**
如何修改modified_transform_to_array()
以返回预期的数组形状(n_samples, 1, chunk_size, 4)
,因为它要快得多?
2条答案
按热度按时间xt0899hw1#
在
modified_transform_to_array
中返回X之前,添加一个新轴到X,例如:oxcyiej72#
您可以简单地在
modified_transform_to_array()
结束时返回X
之前对X
执行reshape
,例如: