在Pandas DataFrame中使用NaN将多个String列拆分为行

9q78igpj  于 2023-06-28  发布在  其他
关注(0)|答案(2)|浏览(113)

我有一个示例DataFrame如下:
|track_id|跟踪日期|状态|status_info|
| - -----|- -----|- -----|- -----|
| track_1| 2021-01-01 2021- 01-01|核准的|无|
| Track_2| 2021-01-02 2021-01-02|无|认可的|
| Track_3| 2021-01-03 2021-01-03|核准的|认可的|
| Track_4| 2021-01-04 2021-01-04|核准的|核准的|
| 轨道_5| 2021-01-05 2021-01-05 2021-01-05|核定|数|已取消认可||
| track_6| 2021-01-06 2021-01-06|无|已取消认可||
我需要将statusstatus_info拆分成行,因此它会给出类似于下面的输出:
|track_id|跟踪日期|状态|status_info|
| - -----|- -----|- -----|- -----|
| track_1| 2021-01-01 2021- 01-01|核准的|无|
| Track_2| 2021-01-02 2021-01-02|无|认可的|
| Track_3| 2021-01-03 2021-01-03|核准的|认可的|
| Track_4| 2021-01-04 2021-01-04|核准的|核准的|
| 轨道_5| 2021-01-05 2021-01-05 2021-01-05|核准的|认可的|
| 轨道_5| 2021-01-05 2021-01-05 2021-01-05|核准的|取消|
| track_6| 2021-01-06 2021-01-06|无|认可的|
| track_6| 2021-01-06 2021-01-06|无|取消|
我已经尝试了下面的代码,使用this answer in another question作为参考:

# splitting string values into lists
new_status = df['status'].str.split('|', expand=True).stack().reset_index(level=1, drop=True)
new_status_info = df['status_info'].str.split('|', expand=True).stack().reset_index(level=1, drop=True)

# generating a temporary DataFrame to join later (error here)
df_split = pd.concat([new_status, new_status_info], axis=1, keys=['status', 'status_info'])

# then, we join both DataFrames
df.drop(columns=['status','status_info'], axis=1).join(df_split).reset_index(drop=True)

但它给了我一个ValueError:

ValueError: cannot reindex from a duplicate axis

当我在split步骤中将.reset_index(level=1, drop=True)修改为.reset_index(drop=True)时,join操作只给我带来了其中一个值,而不是预期的两个值:
|track_id|跟踪日期|状态|status_info|
| - -----|- -----|- -----|- -----|
| track_1| 2021-01-01 2021- 01-01|核准的|无|
| Track_2| 2021-01-02 2021-01-02|无|认可的|
| Track_3| 2021-01-03 2021-01-03|核准的|认可的|
| Track_4| 2021-01-04 2021-01-04|核准的|核准的|
| 轨道_5| 2021-01-05 2021-01-05 2021-01-05|核准的|取消|
| track_6| 2021-01-06 2021-01-06|无|认可的|

brc7rcf0

brc7rcf01#

您可以meltexplodepivot

cols = ['track_id', 'track_date']

(df.melt(cols, ignore_index=False).reset_index()
   .assign(value=lambda d: d['value'].str.split('|'))
   .explode('value')
   .assign(n=lambda d: d.groupby(level=0).cumcount())
   .pivot(index=cols+['index', 'n'], columns='variable', values='value')
   .reset_index(cols).droplevel('n')
   .groupby(level=0).ffill()
   .rename_axis(index=df.index.name, columns=df.columns.name)
)

输出:

track_id  track_date    status status_info
0  track_1  2021-01-01  approved        None
1  track_2  2021-01-02      None  accredited
2  track_3  2021-01-03  approved  accredited
3  track_4  2021-01-04  approved    approved
4  track_5  2021-01-05  approved  accredited
4  track_5  2021-01-05  approved   cancelled
5  track_6  2021-01-06      None  accredited
5  track_6  2021-01-06      None   cancelled
kmbjn2e3

kmbjn2e32#

您可以尝试使用itertools.zip_longest

from itertools import zip_longest

df['status'] = df['status'].str.split('|')
df['status_info'] = df['status_info'].str.split('|')

# make `status`, `status_info` columns the same length
df[['status', 'status_info']] = df[['status', 'status_info']].apply(lambda x: [*zip(*zip_longest(x['status'], x['status_info']))], axis=1, result_type='expand')
print(df.explode(['status', 'status_info']))

图纸:

track_id  track_date    status status_info
0  track_1  2021-01-01  approved        None
1  track_2  2021-01-02      None  accredited
2  track_3  2021-01-03  approved  accredited
3  track_4  2021-01-04  approved    approved
4  track_5  2021-01-05  approved  accredited
4  track_5  2021-01-05  approved   cancelled
5  track_6  2021-01-06      None  accredited
5  track_6  2021-01-06      None   cancelled

相关问题