pandas 使用DataFrame进行版本控制

dxxyhpgq  于 2023-02-02  发布在  其他
关注(0)|答案(2)|浏览(105)

我试图比较两个 Dataframe ,以检查它们之间有什么变化。这是版本控制脚本的一部分,所以我做了一个简化版本,试图找到一个解决方案:

data = {'ID':  ['1', '2', '3', '4'],
        'Date':  ['23-01-2023', '01-12-1995', '03-07-2013', '05-09-2013'],
        'Time':  ['01:45:08', '02:15:21', '23:57:14', '03:57:15'],
        'Path':  ['//server/test/File1.txt', '//server/test/File2.txt', '//server/test/File3.txt', '//server/test/File4.txt'],
        }
data2 = {'ID':  ['1', '2', '3'],

        'Date':  ['23-01-2023', '03-07-2013', '01-12-1995', '05-09-2013'],
        'Time':  ['01:45:08', '23:57:14', '02:17:21', '03:18:31'],
        'Path':  ['//server/test/File1.txt', '//server/test/File3.txt', '//server/test/File2.txt', '//server/test/File5.txt'],
        }

df = pd.DataFrame(data)
df2 = pd.DataFrame(data2)

因此,我创建了2个 Dataframe ,如下所示:

数据框1

| ID | Date       | Time      | Path                       |
 |  1 | 23-01-2023 | 01:45:08  | //server/test/File1.txt    |
 |  2 | 01-12-1995 | 02:15:21  | //server/test/File2.txt    |
 |  3 | 03-07-2013 | 23:57:14  | //server/test/File3.txt    |
 |  4 | 05-09-2013 | 03:57:15  | //server/test/File4.txt    |

数据框2

| ID | Date       | Time      | Path                       |
 |  1 | 23-01-2023 | 01:45:08  | //server/test/File1.txt    |
 |  2 | 03-07-2013 | 23:57:14  | //server/test/File3.txt    |
 |  3 | 01-12-1995 | 02:17:21  | //server/test/File2.txt    |
 |  4 | 21-11-1991 | 03:18:31  | //server/test/File5.txt    |

以我知道的第一个为参考:

  1. ID为4的文件已被删除
    1.文件2已修改
    1.已添加新文件(表dataframe 2中的ID 4)
    最后,我希望得到以下输出:
| ID | Date       | Time      | Path                       |  Status  |
 |  1 | 23-01-2023 | 01:45:08  | //server/test/File1.txt    |     -    |
 |  2 | 01-12-1995 | 02:15:21  | //server/test/File2.txt    |  UPDATED |
 |  3 | 03-07-2013 | 23:57:14  | //server/test/File3.txt    |     -    |
 |  4 | 05-09-2013 | 03:57:15  | //server/test/File4.txt    |  DELETED |
 |  5 | 21-11-1991 | 03:18:31  | //server/test/File5.txt    |   ADDED  |

可以只使用Pandas的JOIN来完成吗?

nzkunb0c

nzkunb0c1#

如果您打算比较路径,请将其移动到索引中并使用concat

compare = pd.concat([
    df.drop(columns="ID").set_index("Path"),
    df2.drop(columns="ID").set_index("Path")
], keys=["Old", "New"], axis=1)

# I assume `Date` is non-nullable. You can pick any non-nullable
# column to use for the ADDED / DELETED check
compare["Status"] = np.select(
    [
        (compare["Old"] == compare["New"]).all(axis=1),
        compare[("Old", "Date")].isnull(),
        compare[("New", "Date")].isnull(),
    ],
    [
        "-",
        "ADDED",
        "DELETED",
    ],
    "UPDATED",
)
juzqafwq

juzqafwq2#

解决方案包括:

  • Path键上的merging Dataframe ,具有关键indicator参数,以获得指示哪一侧(或两侧)成功合并的标记
  • Status字段Map到预定义Map{'left_only': 'DELETED', 'right_only': 'ADDED'}
  • 通过基本比较两侧的Date/Time查找UPDATED记录
  • 调整列名和空字段
df_ = df.merge(df2, on='Path', how='outer', indicator=True)
status_map = {'left_only': 'DELETED', 'right_only': 'ADDED'}
df_['Status'] = df_['_merge'].map(status_map)
df_['Status'] = np.where(df_['_merge'] == 'both',
                     np.where((df_['Date_y'] != df_['Date_x'])
                              | (df_['Time_y'] != df_['Time_x']),
                              'UPDATED', '-'), df_['Status'])
df_.loc[df_['Date_x'].isnull(), ['Date_x', 'Time_x']] = df_[df_['Date_x'].isnull()][['Date_y', 'Time_y']].values
df_.drop(df_.filter(regex='.*(_y|_merge)$').columns, axis=1, inplace=True)
df_.columns = df_.columns.str.replace(r'_x$', '', regex=True)
df_.loc[df_['ID'].isnull(), 'ID'] = range(int(df_['ID'].max()) + 1, df_.index.size + 1)
print(df_)
ID        Date      Time                     Path   Status
0  1.0  23-01-2023  01:45:08  //server/test/File1.txt        -
1  2.0  01-12-1995  02:15:21  //server/test/File2.txt  UPDATED
2  3.0  03-07-2013  23:57:14  //server/test/File3.txt        -
3  4.0  05-09-2013  03:57:15  //server/test/File4.txt  DELETED
4  5.0  21-11-1991  03:18:31  //server/test/File5.txt    ADDED

相关问题