pandas 为什么python的panda在比较两个csv文件时返回difficult found,并且单元格为空?

pvabu6sv  于 12个月前  发布在  Python
关注(0)|答案(1)|浏览(79)


(1)我正在使用Python的pandas比较两个csv文件。在这两个文件中有完全相同的数据集,应该返回类似于“两个文件相同”的语句。但是,有一列的标题为“错误”,该列为空,因为没有记录错误值。
(2)当我做一个文件比较,脚本拿起“错误”列为真(或差异发现)
(3)我的代码在下面
(4)有人能帮忙吗?如果单元格是空的,我如何避免它?实际上,我有另一组数据

,在这两个文件中,有值为“无”的列,它有相同的行为。(两个文件:文件a和文件b在相同的位置,值为“无”,比较结果表明比较后有差异。)
我的代码:

import pandas as pd
import numpy as np  # Import numpy for NaN values

# List of file paths
file_paths = ['test_file_1.csv', 'test_file_2.csv']

# Create a list to store DataFrames
dataframes = []

# Load all CSV files into DataFrames
for file_path in file_paths:
    df = pd.read_csv(file_path)
    dataframes.append(df)

# Initialize a dictionary to store differences
differences = {}

# Compare each pair of DataFrames
for i in range(len(dataframes)):
    for j in range(i + 1, len(dataframes)):
        df1 = dataframes[i]
        df2 = dataframes[j]

        # Check if either DataFrame is None or has errors
        if df1 is None or df2 is None:
            continue

        # Fill empty cells with NaN
        df1 = df1.fillna(np.nan)
        df2 = df2.fillna(np.nan)

        # Compare the DataFrames cell by cell
        comparison_df = df1 != df2  # Use != to create a boolean DataFrame where differences are True
        print("BreakPoint")
        # Find the row and column indices where differences occur
        diff_locations = comparison_df.stack().reset_index()
        diff_locations.columns = ['Row', 'Column', 'Different']

        # Filter rows where differences are True
        diff_locations = diff_locations[diff_locations['Different']]

        # Store differences in the dictionary
        key = f'({file_paths[i]}) vs ({file_paths[j]})'
        differences[key] = diff_locations
        print("break point")

# Output the differences
for key, diff_locations in differences.items():
    if diff_locations.empty:
        print(f"{key}: The two CSV files are identical.")
    else:
        print(f"{key}: The two CSV files have differences at the following locations:")
        print(diff_locations)

字符串

li9yvcax

li9yvcax1#

NaN从不与自己相等。开始时使用df1.fillna(np.nan)是没有意义的--列已经有NaN。我建议您用途:

df1 = df1.fillna('')
    df2 = df2.fillna('')

字符串
或者,更好:

df1.fillna('', inplace=True)
    df2.fillna('', inplace=True)

相关问题