pandas 将列列表拆分为行而不复制数据

fd3cxomn  于 2023-05-05  发布在  其他
关注(0)|答案(6)|浏览(165)

我有一个dataframe,其中第一列是一个列表,我如何迭代列表并将值添加到相关的预定义列:

workflow                cost      cam    gdp     ott    pdl
['cam', 'gdp', 'ott']   $2,346
['pdl', 'ott']          $1,200

应转换为:

workflow                cost      cam    gdp     ott    pdl
['cam', 'gdp', 'ott']   $2,346    782    782     782
['pdl', 'ott']          $1,200                   600    600

我可以得到列表的长度,但我不知道如何迭代列表以将其与列标题匹配。基本上,成本只是在列表中的进程数量之间均匀分配。

j7dteeu8

j7dteeu81#

另一种选择:

df1 = (
    df.assign(cost=
        df["cost"].str.replace(r"\$|,", "", regex=True).astype("float")
        / df["workflow"].str.len()
    )
    .explode("workflow")
    .pivot(columns="workflow", values="cost")
)
df = pd.concat([df[["workflow", "cost"]], df1], axis=1)

样品结果:

workflow    cost    cam    gdp    ott    pdl
0  [cam, gdp, ott]  $2,346  782.0  782.0  782.0    NaN
1       [pdl, ott]  $1,200    NaN    NaN  600.0  600.0
lg40wkob

lg40wkob2#

你可以这样做:

for i in df.index:
    cost = float(df.loc[i,'cost'][1:].replace(',',''))
    cols = df.loc[i, 'workflow']
    df.loc[i,cols] = cost / len(cols)

输出:

workflow    cost    cam    gdp    ott    pdl
0  [cam, gdp, ott]  $2,346  782.0  782.0  782.0    NaN
1       [pdl, ott]  $1,200    NaN    NaN  600.0  600.0
doinxwow

doinxwow3#

您可以尝试以下操作:

import pandas as pd
import numpy as np

df = pd.DataFrame(dict(workflow=[['cam', 'gpd', 'ott'], ['pdl', 'ott']], 
                  cost=[2346, 1200]), dtype=object)

defaults = [782, 600]

for default_value, row in zip(defaults, df.iterrows()):
    index, row = row
    
    for col in row['workflow']:
        if col not in df:
            df[col] = [np.nan] * len(df)
        
        df.at[index, col] = default_value

print(df)
# yields
workflow  cost    cam    gpd    ott    pdl
0  [cam, gpd, ott]  2346  782.0  782.0  782.0    NaN
1       [pdl, ott]  1200    NaN    NaN  600.0  600.0
bqf10yzr

bqf10yzr4#

使用自定义df.apply

def fill_cols(x):
    cost = float(x['cost'][1:].replace(',', ''))
    x[x['workflow']] = cost / len(x['workflow'])
    return x

df = df.apply(fill_cols, axis=1)
workflow    cost    cam    gdp    ott    pdl
0  [cam, gdp, ott]  $2,346  782.0  782.0  782.0    NaN
1       [pdl, ott]  $1,200    NaN    NaN  600.0  600.0
gojuced7

gojuced75#

另一种可能的解决方案:

df['workflow'].to_frame().join(
   pd.DataFrame.from_records(
      [(x, y, x1, y/len(x), z) for x, y, z in 
       zip(df['workflow'], df['cost'], df.index)
       for x1 in x],
      columns = df.columns.to_list() + ['aux1', 'aux2', 'id'])
   .pivot(index=['id', 'cost'], columns='aux1', values='aux2')
   .rename_axis(None, axis=1).reset_index()).drop('id', axis=1)

输出:

workflow  cost    cam    gpd    ott    pdl
0  [cam, gpd, ott]  2346  782.0  782.0  782.0    NaN
1       [pdl, ott]  1200    NaN    NaN  600.0  600.0
798qvoo8

798qvoo86#

这不是最好的方法,但你可以试试:

splits = (
   df.iloc[:, :2].explode("workflow").set_index("workflow", append=True)
     .assign(cost= lambda x: pd.to_numeric(x["cost"].replace(r"\$|,", "", regex=True)))
     .T.groupby(level=0, axis=1, group_keys=False).apply(lambda g: g/len(*g.to_numpy()))
     .stack(0).reset_index(drop=True)
)

out = df.iloc[:, :2].join(splits)

输出:

print(out)

          workflow    cost    cam    gdp    ott    pdl
0  [cam, gdp, ott]  $2,346 782.00 782.00 782.00    NaN
1       [pdl, ott]  $1,200    NaN    NaN 600.00 600.00

相关问题