csv 数据框正在创建副本,不确定出现了什么问题

pn9klfpd  于 2022-12-15  发布在  其他
关注(0)|答案(1)|浏览(146)

我的python代码:

import pandas as pd

student_dict = {
"ID":[101,102,103,104,105],
"Student":["AAA","BBB","CCC","DDD","EEE"],
"Mark":[50,100,99,60,80],
"Address":["St.AAA","St.BBB","St.CCC","St.DDD","St.EEE"],
"PhoneNo":[1111111111,2222222222,3333333333,4444444444,5555555555]
}

df = pd.DataFrame(student_dict)

print("First dataframe")
print(df)

fName = "Student_CSVresult.csv"

def dups(df):
    df = df.drop_duplicates(keep=False)
    return df

try:
    data = pd.read_csv(fName)
    print("CSV file data")
    print(data)
    df_merged = pd.concat([data, df])
    df = dups(df_merged)
    print("After removing dups")
    print(df)
    df.to_csv(fName, mode='a', index=False,header=False)
except FileNotFoundError:
    print("File Not Found Error")
    df = df.drop_duplicates()
    df.to_csv(fName, index=False)
    print("New file created and data imported")
except Exception as e:
    print(e)

在第一次运行时,导入所有数据,没有任何副本。

我在下一次运行中给出了不同的 Dataframe

student_dict = {
"ID":[101,102,103,104,105,101,102,103,104,105,106,107],
"Student":["AAA","BBB","CCC","DDD","EEE","AAA","BBB","CCC","DDD","EEE","YYY","ZZZ"],
"Mark":[50,100,99,60,80,50,100,99,60,80,100,80],
"Address":["St.AAA","St.BBB","St.CCC","St.DDD","St.EEE","St.AAA","St.BBB","St.CCC","St.DDD","St.EEE","St.AYE","St.ZZZ"],
"PhoneNo":[1111111111,2222222222,3333333333,4444444444,5555555555,1111111111,2222222222,3333333333,4444444444,5555555555,6666666666,7777777777]
}

也没有问题,然后我再次给出第一个 Dataframe 。

student_dict = {
"ID":[101,102,103,104,105],
"Student":["AAA","BBB","CCC","DDD","EEE"],
"Mark":[50,100,99,60,80],
"Address":["St.AAA","St.BBB","St.CCC","St.DDD","St.EEE"],
"PhoneNo":[1111111111,2222222222,3333333333,4444444444,5555555555]
}

它复制了

有人能帮我解决这个问题吗?我不想覆盖主文件(Student_CSVresult.csv),只是附加而已
另外,是否有办法在文件中创建一个新列,以自动捕获数据条目的时间戳?

tzdcorbm

tzdcorbm1#

目前对您的程序的描述:

  • df包含新记录。
  • 将“Student_CSVresult.csv”的内容加载到data
  • dfdata合并为df_merged
  • 使用keep=Falsedf_merged中删除重复项,以便df_merged仅包含在dfdata中恰好出现一次的记录
  • df_merged的内容附加到“Student_CSV结果.csv”

我不认为这是你想做的,问题就在这一步:

  • 使用keep=Falsedf_merged中删除重复项,以便df_merged仅包含在dfdata中恰好出现一次的记录

在测试的第三步中,当您再次显示第一步中的数据时,在第2步中显示但在第1步中未显示的记录恰好出现一次(仅在csv文件中),因此它们被追加到csv文件中,从而导致csv文件中出现重复项,这似乎不是您想要的。
你的问题不是100%清楚,但是我认为你想把新出现的学生数据中的任何记录的一个示例附加到csv文件中,而这些记录还没有出现在csv文件中。要做到这一点,你需要找到新数据中出现但没有出现在csv文件中的记录。this answer中描述了这样做的方法。下面是我基于它编写的一个函数:

def in_df2_only(df1, df2):
    merged = pd.merge(df1, df2, how='outer', indicator=True)
    return merged[merged['_merge'] == 'right_only'].drop(columns=['_merge'])

在我编辑你的代码时,我更改了一些变量名等,以使代码更可读,并创建了一个循环,一次运行测试的所有三个步骤,这样我就不必运行程序三次,每次测试都替换student_dict

import pandas as pd

student_dicts = [
    {
        "ID":[101,102,103,104,105],
        "Student":["AAA","BBB","CCC","DDD","EEE"],
        "Mark":[50,100,99,60,80],
        "Address":["St.AAA","St.BBB","St.CCC","St.DDD","St.EEE"],
        "PhoneNo":[1111111111,2222222222,3333333333,4444444444,5555555555]
    },
    {
        "ID":[101,102,103,104,105,101,102,103,104,105,106,107],
        "Student":["AAA","BBB","CCC","DDD","EEE","AAA","BBB","CCC","DDD","EEE","YYY","ZZZ"],
        "Mark":[50,100,99,60,80,50,100,99,60,80,100,80],
        "Address":["St.AAA","St.BBB","St.CCC","St.DDD","St.EEE","St.AAA","St.BBB","St.CCC","St.DDD","St.EEE","St.AYE","St.ZZZ"],
        "PhoneNo":[1111111111,2222222222,3333333333,4444444444,5555555555,1111111111,2222222222,3333333333,4444444444,5555555555,6666666666,7777777777]
    },
    {
        "ID":[101,102,103,104,105],
        "Student":["AAA","BBB","CCC","DDD","EEE"],
        "Mark":[50,100,99,60,80],
        "Address":["St.AAA","St.BBB","St.CCC","St.DDD","St.EEE"],
        "PhoneNo":[1111111111,2222222222,3333333333,4444444444,5555555555]
    },
]

fName = "Student_CSVresult.csv"

# ====================
def in_df2_only(df1, df2):
    merged = pd.merge(df1, df2, how='outer', indicator=True)
    return merged[merged['_merge'] == 'right_only'].drop(columns=['_merge'])

# ====================
for student_dict in student_dicts:
    student_df = pd.DataFrame(student_dict)
    print("Student data: ")
    print(student_df)
    try:
        csv_data = pd.read_csv(fName)
        print("Existing CSV data: ")
        print(csv_data)
        not_in_csv = in_df2_only(df1=csv_data, df2=student_df)
        not_in_csv.drop_duplicates().to_csv(fName, mode='a', index=False, header=False)
        print("Records added to CSV: ")
        print(not_in_csv)
    except FileNotFoundError:
        print("File Not Found Error")
        student_df = student_df.drop_duplicates()
        student_df.to_csv(fName, index=False)
        print("New file created and data imported")
    except Exception as e:
        print(e)
    print()
    print('====================')
    print()

csv文件的最终内容为:

ID,Student,Mark,Address,PhoneNo
101,AAA,50,St.AAA,1111111111
102,BBB,100,St.BBB,2222222222
103,CCC,99,St.CCC,3333333333
104,DDD,60,St.DDD,4444444444
105,EEE,80,St.EEE,5555555555
106,YYY,100,St.AYE,6666666666
107,ZZZ,80,St.ZZZ,7777777777

相关问题