对Python和Pandas来说非常新...但问题是我的最终输出文件并没有排除“客户编号”上的任何重复项。任何关于为什么会发生这种情况的建议都将不胜感激!
import pandas as pd
import numpy as np #numpy is the module which can replace errors from huge datasets
from openpyxl import load_workbook
from openpyxl.styles import Font
df_1 = pd.read_excel('PRT Tracings 2020.xlsx', sheet_name='Export') #this is reading the Excel document shifts and looks at sheet
df_2 = pd.read_excel('PRT Tracings 2021.xlsx', sheet_name='Export') #this reads the same Excel document but looks at a different sheet
df_3 = pd.read_excel('PRT Tracings YTD 2022.xlsx', sheet_name='Export') #this reads a different Excel file, and only has one sheet so no need to have it read a sheet
df_all = pd.concat([df_1, df_2, df_3], sort=False) #this combines the sheets from 1,2,3 and the sort function as false so our columns stay in the same order
to_excel = df_all.to_excel('Combined_PRT_Tracings.xlsx', index=None) #this Excel file combines all three sheets into one spreadsheet
df_all = df_all.replace(np.nan, 'N/A', regex=True) #replaces errors with N/A
remove = ['ORDERNUMBER', 'ORDER_TYPE', 'ORDERDATE', 'Major Code Description', 'Product_Number_And_Desc', 'Qty', 'Order_$', 'Order_List_$'] #this will remove all unwanted columns
df_all.drop(columns=remove, inplace=True)
df_all.drop_duplicates(subset=['Customer Number'], keep=False) #this will remove all duplicates from the tracing number syntax with pandas module
to_excel = df_all.to_excel('Combined_PRT_Tracings.xlsx', index=None) #this Excel file combines all three sheets into one spreadsheet
wb = load_workbook('Combined_PRT_Tracings.xlsx') #we are using this to have openpyxl read the data, from the spreadsheet already created
ws = wb.active #this workbook is active
wb.save('Combined_PRT_Tracings.xlsx')
1条答案
按热度按时间ippsafx71#
您应该将
df_all.drop_duplicates
的传回值指派给变数,或设定inplace=True
以覆写DataFrame内容。这是为了防止原始数据发生不想要的变更。请尝试:
或等效的:
这将从DataFrame中删除所有重复行。如果要保留包含重复行的第一行或最后一行,请将
keep
更改为first
或last
。