numpy 为什么在Pandas中没有删除重复项

q9yhzks0  于 2022-11-24  发布在  其他
关注(0)|答案(1)|浏览(180)

对Python和Pandas来说非常新...但问题是我的最终输出文件并没有排除“客户编号”上的任何重复项。任何关于为什么会发生这种情况的建议都将不胜感激!

import pandas as pd
import numpy as np #numpy is the module which can replace errors from huge datasets 
from openpyxl import load_workbook
from openpyxl.styles import Font

df_1 = pd.read_excel('PRT Tracings 2020.xlsx', sheet_name='Export') #this is reading the Excel document shifts and looks at sheet
df_2 = pd.read_excel('PRT Tracings 2021.xlsx', sheet_name='Export') #this reads the same Excel document but looks at a different sheet
df_3 = pd.read_excel('PRT Tracings YTD 2022.xlsx', sheet_name='Export') #this reads a different Excel file, and only has one sheet so no need to have it read a sheet

df_all = pd.concat([df_1, df_2, df_3], sort=False) #this combines the sheets from 1,2,3 and the sort function as false so our columns stay in the same order

to_excel = df_all.to_excel('Combined_PRT_Tracings.xlsx', index=None) #this Excel file combines all three sheets into one spreadsheet

df_all = df_all.replace(np.nan, 'N/A', regex=True) #replaces errors with N/A

remove = ['ORDERNUMBER', 'ORDER_TYPE', 'ORDERDATE', 'Major Code Description', 'Product_Number_And_Desc', 'Qty', 'Order_$', 'Order_List_$'] #this will remove all unwanted columns
df_all.drop(columns=remove, inplace=True)

df_all.drop_duplicates(subset=['Customer Number'], keep=False) #this will remove all duplicates from the tracing number syntax with pandas module

to_excel = df_all.to_excel('Combined_PRT_Tracings.xlsx', index=None) #this Excel file combines all three sheets into one spreadsheet

wb = load_workbook('Combined_PRT_Tracings.xlsx') #we are using this to have openpyxl read the data, from the spreadsheet already created
ws = wb.active #this workbook is active

wb.save('Combined_PRT_Tracings.xlsx')
ippsafx7

ippsafx71#

您应该将df_all.drop_duplicates的传回值指派给变数,或设定inplace=True以覆写DataFrame内容。这是为了防止原始数据发生不想要的变更。
请尝试:

df_all = df_all.drop_duplicates(subset='Customer Number', keep=False)

或等效的:

df_all.drop_duplicates(subset='Customer Number', keep=False, inplace=True)

这将从DataFrame中删除所有重复行。如果要保留包含重复行的第一行或最后一行,请将keep更改为firstlast

相关问题