在Python中扫描多个json文件并将所有数据(不重复)合并到.csv文件中

gojuced7  于 11个月前  发布在  Python
关注(0)|答案(1)|浏览(133)

我试图遍历多个json文件,将它们附加到一个大列表中,然后在最后扫描大列表,将每个(唯一)数据条目添加到CSV文件中,但不知道如何做到这一点。
哦,我还有一个问题,我希望每个JSON文件都有一个时间戳,它们是这样创建的 “AJSONFILE_2023-11-21_11-18-21.json”。我希望每个时间戳都是我最后创建的CSV文件中唯一的一列。
JSON文件的内容如下所示:

{
  "apps": {
    "module_a": {
      "state": "Starting",
      "restart counter": 0
    },
    "module_b": {
      "state": "Starting",
      "restart counter": 1
    }
  }
}

字符串
我的代码目前看起来像这样:

import csv
import json
import os
import re

def scan_caps_files() -> None:
    json_files_directory = "FILE_DIRECTORY"

    all_data = []
    timestamp_pattern = re.compile(r"\d{4}-\d{2}-\d{2}_\d{2}-\d{2}-\d{2}")
    filetime = None
    extracted_timestamps = []
    extracted_info = []

    extracted_info.append({"Name": "Name", "Timestamp": "-", "Restart Counter": "Restart Counter"})

    result = list()

    for filename in os.listdir(json_files_directory):
        if filename.endswith(".json"):
            file_path = os.path.join(json_files_directory, filename)
            filetime = timestamp_pattern.search(filename)

            try:
                with open(file_path, "r") as file:
                    json_content = json.load(file)
                    all_data.append(json_content)
                    extracted_timestamps.append(filetime.group())
            except Exception as e:
                print(f"Error while reading file {file_path}: {e}")

    with open('all_crashes.json', 'w') as output_file:
        json.dump(result, output_file)
        print("Path to newly created JSON file:", os.path.abspath("all_crashes.json"))

    try:
        for data in all_data:
            crash_data = data.get("apps", {})

            for app_name, app_data in crash_data.items():
                new_info = {"Name": "", "Timestamp": "", "Restart Counter": 0}

                new_info["Name"] = app_name
                new_info["Timestamp"] = app_data.get("state", "N/A")
                new_info["Restart Counter"] = app_data.get("restart counter", 0)

                result.extend(new_info)
                print(result)
                if app_name not in extracted_info:
                    extracted_info.append(new_info)
                else:
                    print("Already in the list")
    except Exception as e:
        print(f"An error occurred when trying to scan the file(s): {e}")

    csv_file_path = "output.csv"
    csv_header = ["Name", "Timestamp", "Restart Counter"]

    with open(csv_file_path, mode="w", newline="", encoding="utf-8") as csv_file:
        writer = csv.DictWriter(csv_file, fieldnames=csv_header)
        writer.writeheader()
        writer.writerows(extracted_info)

    print("Absolute Path for CSV File:", os.path.abspath(csv_file_path))

if __name__ == "__main__":
    scan_caps_files()


但是当我使用“csvlook output.csv”时得到的结果是这样的:

**| Name                      | Timestamp  | Restart Counter |
| --------------------------- | ---------- | --------------- |
| Name                        | -          | Restart Counter |
| module_a                    | Starting   | 0               |
| module_b                    | Starting   | 1               |
| module_a                    | Running    | 0               |
| module_b                    | Terminated | 1               |**


目前我只使用两个JSON文件(用于测试目的),名为“AJSONFILE_2023-11-21_11-18-21.json”和“AJSONFILE_2023-11-21_11-19-23.json"。
我想要的问题或更多:我希望它看起来更像没有重复的模块名称,并且每个文件/时间戳都有一个唯一的列,再加上**在CSV文件的末尾添加所有的“restart_counter”**数据,看起来像这样:

**| Name                      | 2023-11-21_11-18-21 | 2023-11-21_11-19-24 | Restart Counter |
| --------------------------- | ------------------- | ------------------- | --------------- |
| module_a                    | Starting            | Running             | 0               |
| module_b                    | Starting            | Terminated          | 2               |**


如果有人能给予我一些指点,我会非常感激!
我试着谷歌,在Stackoverflow中查找,使用ChatGPT。

yshpjwxd

yshpjwxd1#

我会通过阅读所有JSON文件并建立中间数据结构来实现这一点,然后将其转换为最终的CSV。
我模拟了这三个输入JSON,添加了一个模块_c,并使其不会出现在每个文件中:

  • input_2023-11-21_11-18-21.json:
{
  "apps": {
      "module_a": {"state": "Starting", "restart counter": 0},
      "module_b": {"state": "Starting", "restart counter": 1},
      "module_c": {"state": "Starting", "restart counter": 1}
  }
}

字符串

  • input_2023-11-21_11-19-24.json:
{
  "apps": {
    "module_a": {"state": "Running", "restart counter": 0},
    "module_b": {"state": "Terminated", "restart counter": 1}
  }
}

  • input_2023-11-21_11-20-07.json:
{
  "apps": {
    "module_c": {"state": "Running", "restart counter": 0}
  }
}


中间数据结构需要累积重新启动,并将时间戳与状态配对。对于上面的示例文件,我将结构描述为:

data = {
    "module_a": {
        "Restart Counter": 0,
        "2023-11-21_11-18-21": "Starting",
        "2023-11-21_11-19-24": "Running",
    },
    "module_b": {
        "Restart Counter": 2,
        "2023-11-21_11-18-21": "Starting",
        "2023-11-21_11-19-24": "Terminated",
    },
    "module_c": {
        "Restart Counter": 1,
        "2023-11-21_11-18-21": "Starting",
        "2023-11-21_11-20-07": "Running",
    },
}


你可以像这样构建这个结构:

data = {}
for fname in sorted(glob.glob("*.json")):
    # ':-5' to chomp '.json', '-19:' to take just the date-time stamp
    dt_stamp = fname[:-5][-19:]

    with open(fname) as f:
        for module_name, values in json.load(f)["apps"].items():
            module = data.get(module_name, {"Restart Counter": 0})

            module["Restart Counter"] += values["restart counter"]
            module[dt_stamp] = values["state"]

            data[module_name] = module


您可以使用csv模块的DictWriter将该结构(只需要一点工作)转换为最终的CSV。
DictWriter要求你告诉它在它试图行化的dicts中期望什么字段名,你还可以控制这些字段名的顺序:

timestamps = set()
for values in data.values():
    for x in values:
        if x != "Restart Counter":
            timestamps.add(x)
final_fieldnames = ["Name"] + sorted(timestamps) + ["Restart Counter"]

使用这些字段名创建DictWriter,然后循环遍历数据中的项。您需要获取键(模块名称)和值,并将它们合并组合成单个dict {"Name": module_name, **values},以传递给DictWriter的writerow()方法:

with open("output.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=final_fieldnames, )
    writer.writeheader()
    for module_name, values in sorted(data.items()):
        writer.writerow({"Name": module_name, **values})
| Name     | 2023-11-21_11-18-21 | 2023-11-21_11-19-24 | 2023-11-21_11-20-07 | Restart Counter |
|----------|---------------------|---------------------|---------------------|-----------------|
| module_c | Starting            |                     | Running             | 1               |
| module_b | Starting            | Terminated          |                     | 2               |
| module_a | Starting            | Running             |                     | 0               |

DictWriter通过写入空字符串来处理一行没有所有字段名的情况,您可以通过设置restval=属性来更改此情况:

...
writer = csv.DictWriter(f, fieldnames=final_fieldnames, restval="FOO-BAR")
...
| Name     | 2023-11-21_11-18-21 | 2023-11-21_11-19-24 | 2023-11-21_11-20-07 | Restart Counter |
|----------|---------------------|---------------------|---------------------|-----------------|
| module_a | Starting            | Running             | FOO-BAR             | 0               |
| module_b | Starting            | Terminated          | FOO-BAR             | 2               |
| module_c | Starting            | FOO-BAR             | Running             | 1               |

相关问题