在嵌套循环中通过键合并两个csv文件只在第一次迭代中起作用

ca1c2owp  于 2023-07-31  发布在  其他
关注(0)|答案(1)|浏览(87)

我有两个csv文件,它们有一个公共密钥,第二列是file 1,第一列是file 2。
我想写一个新的csv文件,其中包含两个文件中的列(不是两个文件中的所有列,只是我需要的列)。
这就是我所尝试的:

import numpy as np
import csv
import os

output_dir = "<output_dir>"
file1 = "f1.csv"
file2 = "f2.csv"

path1 = os.path.join(output_dir, file1)
path2 = os.path.join(output_dir, file2)

file3 = "fd_labels_processed.csv"
output_file = os.path.join(output_dir, file3)

with open(path1, 'r') as f1, open(path2, 'r') as f2, \
        open(output_file, 'w+', newline='') as f3:
    f1_reader = csv.reader(f1, delimiter=',')
    f2_reader = csv.reader(f2, delimiter=',')
    header_f1 = []
    header_f1 = next(f1_reader) # reading the next line after header of csv file.

    header_f2 = []
    header_f2 = next(f2_reader) # reading the next line after header of csv file.
    count = 0
    writer = csv.writer(f3, delimiter=',') #preparing the file f3 for writing the file.

    writer.writerow(["ATTRIBUTE_1", "ATTRIBUTE_2", "ATTRIBUTE_3", "ATTRIBUTE_4", "ATTRIBUTE_5", "ATTRIBUTE_6", "ATTRIBUTE_7", "ATTRIBUTE_8", "ATTRIBUTE_9"])

    for row_f1 in f1_reader: # looking each row from csv file f1
        for row_f2 in f2_reader: # looking for each row from csv file f2
            if row_f1[1] == row_f2[0]: #checking the condition; worse case Time complexity o(n2)
                if (row_f1[3] == row_f2[2] and row_f1[4] == row_f2[3] and row_f1[5] == row_f2[4] and row_f1[6] == row_f2[5]):
                    print(count)
                    writer = csv.writer(f3)
                    row_f2.append(row_f1[9])
                    row_f2.append(row_f1[11])
                    row_f2.append(row_f1[12])
                    writer.writerows([row_f2])
                    count +=1

字符串
但由于某种原因,它从f1_reader中获取第一行(在header之后),遍历f2_reader中的所有行一次,然后停止。所以外面的for-循环在第一行之后停止。

67up9zun

67up9zun1#

要在公共列上连接两个CSV,首先需要读取一个CSV并将其行存储在公共列上,然后读取第二个CSV并更新第一次读取的行。
我模仿了这两个CSV:

input1.csv
==========
| ID | Name | Day |
|----|------|-----|
| 1  | Foo  | Sun |
| 2  | Bar  | Mon |
| 3  | Baz  | Wed |

input2.csv
==========
| ID | Week |
|----|------|
| 3  | 15   |
| 2  | 10   |
| 1  | 5    |

字符串
如果我想要一个最终的合并CSV(减去Day列),如:

| ID | Name | Week |
|----|------|------|
| 1  | Foo  | 5    |
| 2  | Bar  | 10   |
| 3  | Baz  | 15   |


然后我需要一些代码如下:

import csv

rows: dict[str, dict[str, str]] = {}

with open("input1.csv", newline="", encoding="utf-8") as f:
    reader = csv.DictReader(f)

    for row in reader:
        id_ = row["ID"]
        rows[id_] = row

with open("input2.csv", newline="", encoding="utf-8") as f:
    reader = csv.DictReader(f)

    for row in reader:
        id_ = row["ID"]
        if id_ not in rows:
            continue
        rows[id_].update(row)

with open("output.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["ID", "Name", "Week"], extrasaction="ignore")
    writer.writeheader()
    writer.writerows(rows.values())


我存储所有信息的数据结构是一个dict,它的键作为公共列值(在我的示例中是ID),并且dict值本身就是表示完整行的dict。
阅读input1.csv后,行看起来像:

{
    "1": {"ID": "1", "Name": "Foo", "Day": "Sun"},
    "2": {"ID": "2", "Name": "Bar", "Day": "Mon"},
    "3": {"ID": "3", "Name": "Baz", "Day": "Wed"},
}


在阅读input2.csv并使用新值更新之前的行之后,行看起来像这样:

{
    "1": {"ID": "1", "Name": "Foo", "Day": "Sun", "Week": "5"},
    "2": {"ID": "2", "Name": "Bar", "Day": "Mon", "Week": "10"},
    "3": {"ID": "3", "Name": "Baz", "Day": "Wed", "Week": "15"},
}


最后,我使用csv模块的DictWriter写回行var的值。我告诉DictWriter我只需要字段名ID、Name和Week,然后我告诉它忽略它可能在写出来的行中找到的任何额外的键/字段名(即“忽略Day列”):

writer = csv.DictWriter(f, fieldnames=["ID", "Name", "Week"], extrasaction="ignore")


您决定首先读取哪个CSV可能取决于一个CSV是否比另一个具有更多的行/键:

  • 是否只需要两个CSV的交集?然后首先读取较小的CSV,并保留“如果ID不在行中,则继续”逻辑。
  • 是否希望所有行都包含空列(如果其他CSV中不存在这些行)?比如:
input2.csv
==========
| ID | Week |
|----|------|
| 4  | 20   |
| 3  | 15   |
| 2  | 10   |
| 1  | 5    |


然后:

...
with open("input2.csv", newline="", encoding="utf-8") as f:
reader = csv.DictReader(f)

for row in reader:
    id_ = row["ID"]
    if id_ not in rows:
        rows[id_] = row
    else:
        rows[id_].update(row)
...


现在输出如下所示:

| ID  | Name | Week |
| --- | ---- | ---- |
| 1   | Foo  | 5    |
| 2   | Bar  | 10   |
| 3   | Baz  | 15   |
| 4   |      | 20   |


如果您不需要自己编写程序,并且可以使用已经编写和测试过的程序,我推荐使用GoCSV的join命令。对于您的情况,file 1将是 * 左文件 *,file 2将是 * 右文件 *,命令如下所示:

gocsv join -c 2,1 file1 file2


对左文件的第二列和右文件的第一列执行默认的内部联接。

相关问题