将json字段中的关键字展开并提取到csv中

jjjwad0x  于 2023-03-04  发布在  其他
关注(0)|答案(1)|浏览(115)

我在df中有一个名为diff的列,其中的值类似于json字符串,格式为:

{'info': {'version': {'from': '2.0.0', 'to': '2.3.4'}}, 'paths': {'modified': {'/dummy': {'operations': {'added': ['PUT']}}}}, 'endpoints': {'added': [{'method': 'PUT', 'path': '/dummy'}]}, 'components': {'schemas': {'added': ['ObjectOfObjects', 'inline_object', 'ObjectOfObjects_inner']}, 'requestBodies': {'added': ['inline_object', 'nested_response']}}}

在这里infopathsendpointscomponents代表第一组嵌套元素。和第一个类别一样,我们还有下一个类别:例如info具有不同的字段,如:titledescription等,components具有如下字段:schemas等等。
df列看起来如下所示:

我想扁平化json,意思是分割所有参数,所以这意味着我会得到大约5-6个新列(因为这些是所有参数的变化,第一组元素)。我不想像你在图片from:... to:..中看到的那样保留这些变化,我只想改变fieldsub-fieldsub-sub-field
所以输出如下所示:

info      paths      endpoints  components
version    modified   added      schemas:added
                                 requestBodies:added

我研究了json_normalizeflattenjsonpath,但不知何故,这些都不适用于这个用例。它产生了与我想要的完全不同的输出。如果有人能帮助我解决这个问题,那就太好了!我似乎有点卡住了。

h9a6wy2h

h9a6wy2h1#

例如,如果您有一个DataFrame df,其中diff列包含嵌套字典,如下所示:

# import pandas as pd 
df = pd.DataFrame({'diff': [
    {'info': {'version': {'from': '2.0.0_1', 'to': '2.3.4_1'}}, 'paths': {'modified': {'/dummy': {'operations': {'added': ['PUT_1']}}}}, 'endpoints': {'added': [{'method': 'PUT_1', 'path': '/dummy_1'}]}, 'components': {'schemas': {'added': ['ObjectOfObjects_1', 'inline_object_1', 'ObjectOfObjects_inner_1']}, 'requestBodies': {'added': ['inline_object_1', 'nested_response_1']}}},
    {'info': {'version': {'from': '2.0.0_2', 'to': '2.3.4_2'}}, 'paths': {'modified': {'/dummy': {'operations': {'added': ['PUT_2']}}}}, 'endpoints': {'added': [{'method': 'PUT_2', 'path': '/dummy_2'}]}, 'components': {'schemas': {'added': ['ObjectOfObjects_2', 'inline_object_2', 'ObjectOfObjects_inner_2']}, 'requestBodies': {'added': ['inline_object_2', 'nested_response_2']}}}
] })

[我假设diff中的字典没有字符串化;如果是,则应使用json.loads/ast.literal_eval/等对其进行解析]
您可以使用类似下面的命令将特定的嵌套值.map到其他列

def try_get(obj, *keys, defaultVal=None):
    try:
        for k in keys: obj = obj[k]
        return obj
    except: return defaultVal

kSep = '.' ## specify a separator
extractKeys = [ ('info', 'version'), 
                ('paths', 'modified'), 
                ('endpoints', 'added'),
                ('components', 'schemas', 'added'), 
                ('components', 'requestBodies', 'added') ]
for kl in extractKeys:
    df[kSep.join(kl)] = df['diff'].map(lambda d: try_get(d, *kl))

如果希望新列出现在新DataFrame中,而不是添加到df中:

f_df = pd.DataFrame([{
    # kSep.join(kl): try_get(d, *kl) for kl in df['diff'] ## NO diff COLUMN
    'diff': d, **{kSep.join(kl): try_get(d, *kl) for kl in extractKeys}
} for d in df['diff']])

或者,如果您只想将diff中的所有嵌套字典扁平化:

def flattenDict(obj:dict, kSep='.', kRoot=[], unlistSingles=True):
    if unlistSingles and isinstance(obj, list) and len(obj)==1: return obj[0]
    if isinstance(obj, (list, set, tuple)) and not kRoot:
        return type(obj)(flattenDict(i, kSep, [], unlistSingles) for i in obj)

    if isinstance(obj, dict):
        kJoin = lambda kx: kSep.join(kRoot + [kx])
        flatDict = {}
        for k, v in obj.items():
            fv = flattenDict(v, kSep, kRoot+[k], unlistSingles)
            if not isinstance(fv, dict):
                flatDict[kSep.join(kRoot + [k])] = fv
                continue
            for kn, vn in fv.items(): flatDict[kn] = vn
        return flatDict
    
    return obj

# f_df = pd.DataFrame(flattenDict(list(df['diff']))) ## NO diff COLUMN
f_df = pd.DataFrame([{'diff': d, **flattenDict(d)} for d in df['diff']])

相关问题