pandas 我如何修改这个函数返回一个四维数组而不是三维数组?

6g8kf2rb  于 2022-12-25  发布在  其他
关注(0)|答案(2)|浏览(151)
    • bounty将在5天后过期**。回答此问题可获得+100的声誉奖励。arilwan正在寻找来自声誉良好的来源的答案

我创建了这个函数,它接受一个dataframe来返回一个ndarrays的输入和标签。

def transform_to_array(dataframe, chunk_size=100):
    
    grouped = dataframe.groupby('id')

    # initialize accumulators
    X, y = np.zeros([0, 1, chunk_size, 4]), np.zeros([0,]) # original inpt shape: [0, 1, chunk_size, 4]

    # loop over each group (df[df.id==1] and df[df.id==2])
    for _, group in grouped:

        inputs = group.loc[:, 'A':'D'].values 
        label = group.loc[:, 'label'].values[0]

        # calculate number of splits
        N = (len(inputs)-1) // chunk_size

        if N > 0:
            inputs = np.array_split(
                 inputs, [chunk_size + (chunk_size*i) for i in range(N)])
        else:
            inputs = [inputs]

        # loop over splits
        for inpt in inputs:
            inpt = np.pad(
                inpt, [(0, chunk_size-len(inpt)),(0, 0)], 
                mode='constant')
            # add each inputs split to accumulators
            X = np.concatenate([X, inpt[np.newaxis, np.newaxis]], axis=0)
            y = np.concatenate([y, label[np.newaxis]], axis=0) 

    return X, y

函数返回了形状(n_samples, 1, chunk_size, 4)X和形状(n_samples, )y
例如:

N = 10_000
id = np.arange(N)
labels = np.random.randint(5, size=N)
df = pd.DataFrame(data = np.random.randn(N, 4),  columns=list('ABCD'))

df['label'] = labels
df.insert(0, 'id', id)
df = df.loc[df.id.repeat(157)]

df.head()
    id      A            B          C            D    label
0   0   -0.571676   -0.337737   -0.019276   -1.377253   1
0   0   -0.571676   -0.337737   -0.019276   -1.377253   1
0   0   -0.571676   -0.337737   -0.019276   -1.377253   1
0   0   -0.571676   -0.337737   -0.019276   -1.377253   1
0   0   -0.571676   -0.337737   -0.019276   -1.377253   1

生成以下内容:

X, y = transform_to_array(df)

X.shape   # shape of input
(20000, 1, 100, 4)
y.shape   # shape of label
(20000,)

此函数按预期正常工作,但需要很长时间才能完成执行:

start_time = time.time()
X, y = transform_to_array(df)
end_time = time.time()
print(f'Time taken: {end_time - start_time} seconds.')
Time taken: 227.83956217765808 seconds.

为了提高函数的性能(最小化执行时间),我创建了以下修改后的函数:

def modified_transform_to_array(dataframe, chunk_size=100):
    # group data by 'id'
    grouped = dataframe.groupby('id')
    # initialize lists to store transformed data
    X, y = [], []

    # loop over each group (df[df.id==1] and df[df.id==2])
    for _, group in grouped:
        # get input and label data for group
        inputs = group.loc[:, 'A':'D'].values 
        label = group.loc[:, 'label'].values[0]

        # calculate number of splits
        N = (len(inputs)-1) // chunk_size

        if N > 0:
            # split input data into chunks
            inputs = np.array_split(
             inputs, [chunk_size + (chunk_size*i) for i in range(N)])
        else:
            inputs = [inputs]

        # loop over splits
        for inpt in inputs:
            # pad input data to have a chunk size of chunk_size
            inpt = np.pad(
            inpt, [(0, chunk_size-len(inpt)),(0, 0)], 
                mode='constant')
            # add each input split and corresponding label to lists
            X.append(inpt)
            y.append(label)

    # convert lists to numpy arrays
    X = np.array(X)
    y = np.array(y)

    return X, y

起初,我似乎成功地缩短了所用的时间:

start_time = time.time()
X2, y2 = modified_transform_to_array(df)
end_time = time.time()
print(f'Time taken: {end_time - start_time} seconds.')
Time taken: 5.842168092727661 seconds.

但是,结果是它改变了预期返回值的形状。

X2.shape  # this should be (20000, 1, 100, 4)
(20000, 100, 4)

y.shape  # this is fine
(20000, )
    • 问题**

如何修改modified_transform_to_array()以返回预期的数组形状(n_samples, 1, chunk_size, 4),因为它要快得多?

xt0899hw

xt0899hw1#

modified_transform_to_array中返回X之前,添加一个新轴到X,例如:

def modified_transform_to_array( ... ):

    ...

    # convert lists to numpy arrays
    X = np.array(X)
    y = np.array(y)
    X = X[:, np.newaxis, ...] # <---in this place
    # X = X[:, None, :, :] 
    return X, y
oxcyiej7

oxcyiej72#

您可以简单地在modified_transform_to_array()结束时返回X之前对X执行reshape,例如:

def modified_transform_to_array( ... ):

    ...

    # convert lists to numpy arrays
    X = np.array(X)
    y = np.array(y)
    X = X.reshape((X.shape[0], 1, *X.shape[1:]))  # <-- THIS LINE
    return X, y

相关问题