csv 遍历存在于dataframe中的多索引字典并将值存储在列表中

nue99wik  于 2023-04-09  发布在  其他
关注(0)|答案(2)|浏览(144)

上面的链接包含 Dataframe ,它有2列[CREATEDAT,RESPONSE],目标是遍历响应列,在该响应列中有一个估计键,在该键中,我必须获取提供程序&storeexternalid

在代码片段中添加样本行,位置为['store-boundary-dsp']['estimates']
我已经写了一个函数,它可能需要一些修改添加在片段

def traverse_dsp(data_frame,column):
    provider = []
    store_id= []

    #Iterate over each row in Dataframe
    for index, row in data_frame.iterrows():
        
        # Iterate over each json object in each row in DataFrame
        for i in range(0,len(row[column])):

            for k,v in row[column]['store-boundary-dsp']['estimates'][i].items():

                if k=="storeExternalId":

                    store_val=v
                    store.append(store_val)

                if k=="provider":

                    provider_val=v
                    provider.append(provider_val)

               

    return provider,store_id
ubof19bj

ubof19bj1#

如果你use nested list comprehension instead嵌套的for循环:

# import pandas as pd
# import ast ## I just needed this to parse the RESPONSE column from csv
# df = pd.read_csv('https://raw.githubusercontent.com/ajayvd/dataframe/main/data_sub.csv')
# df['RESPONSE'] = df['RESPONSE'].apply(ast.literal_eval) # maybe only after read_csv

k3List = ('provider', 'storeExternalId') 
get_e = lambda resp_v:resp_v['store-boundary-dsp']['estimates']
def get_separate_lists(data_frame,column='RESPONSE', k3List=k3List, get_l=get_e): 
    def get_k3(k3):
        return [e[k3] for resp in data_frame[column] for e in get_l(resp) if k3 in e]
      
    isList = isinstance(k3List, (list,tuple,set)) 
    lists = [get_k3(k) for k in (k3List if isList else [k3List])]
    return lists if isList else lists[0]

provider, store_id = get_separate_lists(df)
# provider = get_separate_lists(df, k3List='provider')
# store_id = get_separate_lists(df, 'RESPONSE', 'storeExternalId')

[k3List可以是单个键或键的列表(或元组或集合),get_l应该是一个函数。
如果你想要并行列表,你可以从一个元组列表开始,然后unpackzip基本上将它们“解压缩”到单独的列表中:

# k3List, get_e = ... ## as before
def get_tuple_lists(data_frame,column='RESPONSE', k3List=k3List, get_l=get_e): 
    return [    tuple(e.get(k3) for k3 in k3List) 
                for resp in data_frame[column] for e in get_l(resp)    ]

provider_stores = get_tuple_lists(df)
provider, store_id = [list(t) for t in zip(*provider_stores)]
# provider, store_id = list(zip(*provider_stores)) ## 2 tuples instead of 2 lists

无论使用哪种函数,print(f'{store_id=}\n{provider=}')都应该打印

store_id=['1504', '1504', '9346', '9346', '1035', '4883', '3791', '5464', '5464', '3869', '3869', '7510', '6221', '5708', '5708', '3465']
provider=['Instacart', 'DoorDash', 'DoorDash', 'Uber', 'DoorDash', 'DoorDash', 'DoorDash', 'Postmates', 'DoorDash', 'Skipcart', 'DoorDash', 'DoorDash', 'DoorDash', 'Postmates', 'DoorDash', 'DoorDash']

get_tuple_lists的直接输出如下所示:

provider_stores=[('Instacart', '1504'), ('DoorDash', '1504'), ('DoorDash', '9346'), ('Uber', '9346'), ('DoorDash', '1035'), ('DoorDash', '4883'), ('DoorDash', '3791'), ('Postmates', '5464'), ('DoorDash', '5464'), ('Skipcart', '3869'), ('DoorDash', '3869'), ('DoorDash', '7510'), ('DoorDash', '6221'), ('Postmates', '5708'), ('DoorDash', '5708'), ('DoorDash', '3465')]

如果你不确定所使用的外部键(如上面的 * store-boundary-dsp * 和 * estimates *)是否存在于每一行,你可以在get_l中使用try...except

def get_b(resp_v):
    try: return resp_v['store-boundary-dsp']['boundaries']
    except: return [] 
boundary_names = set(get_separate_lists(df, k3List='name', get_l=get_b))
# --> # boundary_names={'9346 - Area - 1', '1504 - Primary', '1504-Primary'}

只是一些关于你问题中的片段的注解:

for index, row in data_frame.iterrows():
        for i in range(0,len(row[column])):
            for k,v in row[column]['store-boundary-dsp']['estimates'][i].items():
               # if k==....
  • 这里实际上不需要使用.iterrows()range.items()-可以使用
for rc in data_frame[column]:
        for est in rc['store-boundary-dsp']['estimates']:
            if 'storeExternalId' in est: store_id.append(est['storeExternalId'])
            if 'provider' in est: provider.append(est['provider'])
  • 即使在 * for k,v * 循环中,定义store_valprovider_val(作为v)也是多余的[除非您计划在其各自的if块之外使用它们,或者计划以某种方式修改v],此时您只需**.append(v)**
  • 您还可以添加/提取任意数量的列表[从store-boundary-dsp.estimates],方法是将它们添加到下面的all_lists字典中[而不是在 * for est... * 中编写更多的if]
all_lists = {
        'storeExternalId': (store_id := []),
        'provider': (provider := []),
    }

    for rc in data_frame[column]:
          for est in rc['store-boundary-dsp']['estimates']:
              for k in all_lists:
                  if k in est: all_lists[k].append(est[k])

def get_k3(k3, k2='estimates', k1='store-boundary-dsp'):
        return [e[k3] for resp in data_frame[column] for e in resp[k1][k2] if k3 in e]

    k3List = ['provider', 'storeExternalId'] ## line them up EXACTLY
    provider, store_id = [get_k3(k) for k in k3List]

顺便说一句,你也可以使用.explodejson_normalize来完全扁平化DataFrame:

df1 = pd.concat([df[['CREATEDAT']], pd.json_normalize(df['RESPONSE'])], axis=1)
df1.columns = [c.split('.',1)[-1] for c in df1.columns]
lCols = ['value', 'errors', 'tags.1504']
dlCols = ['boundaries', 'distances', 'estimates', 'fulfillments']
for c in (lCols+dlCols): df1 = df1.explode(c)

df1 = pd.concat([df1.drop(dlCols, axis=1).reset_index(drop=True), *[
    pd.json_normalize(df1[c]).rename(columns=lambda cn: f'{c}.{cn}')
    for c in dlCols 
]],  axis=1)#.dropna(axis='columns', thresh=140) 
## dropna(axis='columns',thresh=N)--> only keep columns with < N empty cells

62lalag4

62lalag42#

这将创建一个元组列表。每个元组将是store_id和provider对。apply函数用于迭代RESPONSE列,并将该单元格的值传递给extract_details函数,该函数处理主要的数据提取部分。

def extract_details(response):

    res = []

    # Check to ensure first two keys are in the response, otherwise quit early
    if not 'store-boundary-dsp' in response:
        return res

    if not 'estimates' in response['store-boundary-dsp']:
        return res

    # Iterate through each estimate
    for estimate in response['store-boundary-dsp']['estimates']:
        # Iterate through each key, value pair for the estimage
        store_id = ''
        providor = ''
        for k, v in estimate.items():
            
            if k == 'storeExternalId':
                store_id = v
            if k == 'provider':
                providor = v
        res.append((store_id, providor))
                
    return res

# For each cell in the 'RESPONSE' column, extract out the store_id, providor pair
response_values = data_frame['RESPONSE'].apply(extract_details).tolist()
response_values

pair_values = [val for sublist in response_values for val in sublist]
pair_values

相关问题