pandas 发现特定字符串时中断 Dataframe ,并从相同的 Dataframe 创建多个 Dataframe

w3nuxt5m  于 2023-01-15  发布在  其他
关注(0)|答案(4)|浏览(100)

我拥有的数据格式如下:

col_1         col_2                            col_3

NaN            NaN                              NaN
Date         21-04-2022                         NaN
Id            Name                            status
01            A11                              Pass
02            A22                              F_1
03            A33                              P_2
SUMMARY    'Total :$20  Approved $ 10'         NaN
NaN            NaN                             NaN
Date         22-04-2022                        NaN
Id            Name                           status
04            A12                              P_2
05            A23                              F_1
06            A34                              P_2
SUMMARY    'Total :$30  Approved $ 20'         NaN

预期输出:df_1-

Id            Name                            status
01            A11                              Pass
02            A22                              F_1
03            A33                              P_2
SUMMARY    'Total :$20  Approved $ 10'         NaN

二氟甲烷

Id            Name                           status
04            A12                              P_2
05            A23                              F_1
06            A34                              P_2
SUMMARY    'Total :$30  Approved $ 20'         NaN

以上只是样本数据。实际的列数,我有大约24K。因此,许多df的将创建如何才能接近...?

bq3bfh9z

bq3bfh9z1#

您可以用途:

grp = df['col_1'].eq('Id').cumsum()  # create virtual groups
msk = ~df.isna().all(axis=1) & df['col_1'].ne('Date')  # keep wanted rows

# create a dict with subset dataframes
dfs = {f'df{name}': pd.DataFrame(temp.values[1:], columns=temp.iloc[0].tolist()) 
           for name, temp in df[msk].groupby(grp)}

输出:

>>> dfs['df1']
        Id                       Name status
0       01                        A11   Pass
1       02                        A22    F_1
2       03                        A33    P_2
3  SUMMARY  Total :$20  Approved $ 10    NaN

>>> dfs['df2']
        Id                       Name status
0       04                        A12    P_2
1       05                        A23    F_1
2       06                        A34    P_2
3  SUMMARY  Total :$30  Approved $ 20    NaN

更新:导出到excel:

with pd.ExcelWriter('data.xlsx') as writer:
    for name, temp in dfs.items():
        temp.to_excel(writer, index=False, sheet_name=name)
but5z9lq

but5z9lq2#

您可以创建一个辅助的布尔列,并使用它将 Dataframe 分割成更小的部分:

import pandas as pd
df = pd.DataFrame({'col_1': [1,2,'Id',3,4,5,'SUMMARY',1,2,'Id',3,4,5,'SUMMARY']})

mask = df['col_1'].eq('Id') | df['col_1'].eq('SUMMARY').shift()
df['group_id'] = mask.cumsum()
dfs = list()
for group_id in df['group_id'].unique():
    if group_id % 2 != 0:
        dfs.append(df[df['group_id'].eq(group_id)])

print(dfs[0])
print(dfs[1])
jyztefdp

jyztefdp3#

受piRSquared的答案here的启发,您可以像这样接近您的目标:

import pandas as pd
import numpy as np

df.columns = ["Id", "Name", "Status"]

# is the row a Margin ?
m = df["Id"].eq("SUMMARY")

l_df = list(filter(lambda d: not d.empty, np.split(df, np.flatnonzero(m) + 1)))

_ = [exec(f"globals()['df_{idx}'] = df.reset_index(drop=True) \
                                      .loc[3:].reset_index(drop=True)")
     for idx, df in enumerate(l_df, start=1)]

NB:我们使用globals动态创建变量/子 Dataframe 。

#输出:
print(len(l_df), "DataFrames was created!")
2 DataFrames was created!

print(df_1, type(df_1)), print(df_2, type(df_2)))
    
        Id                         Name Status
0       04                          A12    P_2
1       05                          A23    F_1
2       06                          A34    P_2
3  SUMMARY   'Total :$30 Approved $ 20'    NaN <class 'pandas.core.frame.DataFrame'>

        Id                         Name Status
0       01                          A11   Pass
1       02                          A22    F_1
2       03                          A33    P_2
3  SUMMARY   'Total :$20 Approved $ 10'    NaN <class 'pandas.core.frame.DataFrame'>
oewdyzsn

oewdyzsn4#

使用下面的代码,可以选择所需的行作为列名,并将所需的行作为新数据框行。

import numpy as np

df_1 = pd.DataFrame(data = np.array(df.iloc[3:7]),columns=np.array(df.iloc[2:3])[0])
df_2 = pd.DataFrame(data = np.array(df.iloc[11:15]),columns=np.array(df.iloc[10:11])[0])

输出df_1:

Id            Name                            status
01            A11                              Pass
02            A22                              F_1
03            A33                              P_2
SUMMARY    'Total :$20  Approved $ 10'         NaN

输出df_2:

Id            Name                           status
04            A12                              P_2
05            A23                              F_1
06            A34                              P_2
SUMMARY    'Total :$30  Approved $ 20'         NaN

相关问题