pandas 按顺序排列串联的 Dataframe ,并将NaN替换为某个String

xxls0lw8  于 2023-05-12  发布在  其他
关注(0)|答案(1)|浏览(80)

下面是我的代码,其中我将两个 Dataframe 的不匹配连接在一个单独的 Dataframe 中,并排显示。但是它不根据序列并且以相对于另一 Dataframe 的无序方式示出不匹配的记录。
我已经从两个 Dataframe raw和bob中推导出了不匹配

#data frame 1
raw = pd.DataFrame({'EntityID': ['Appple', 'Mango','Grapes' ,'Banana'],
                    'Date': [10, 20, 30,55], 'Val': [10, 30, 15,89]})

# create data_frame2 by creating a dictionary
# in which values are stored as list
bob = pd.DataFrame({'EntityID': ['Appple','Dongo' ,'Banana', 'Mango','Rea','Hpo','Pauy'],
                    'Date': [10, 20, 30,5,9,8,98], 'Val': [100, 200, 15,21,33,44,44]})

match = raw.merge(bob, how='inner', indicator=False)

#print(match.to_string(index=False))
mismatch = (pd.concat([raw, bob], keys=["Mismatch raw", "Mismatch bob"])
                .drop_duplicates(keep=False).unstack(0)
                .swaplevel(axis=1).sort_index(axis=1, ascending=False)
           )
print(mismatch)

我得到的输出如下。

Mismatch raw                Mismatch bob               
           Val EntityID  Date          Val EntityID  Date
0         10.0   Appple  10.0        100.0   Appple  10.0
1         30.0    Mango  20.0        200.0    Dongo  20.0
2         15.0   Grapes  30.0         15.0   Banana  30.0
3         89.0   Banana  55.0         21.0    Mango   5.0
4          NaN      NaN   NaN         33.0      Rea   9.0
5          NaN      NaN   NaN         44.0      Hpo   8.0
6          NaN      NaN   NaN         44.0     Pauy  98.0

我的预期输出应该是

Mismatch raw                 Mismatch bob               
           Val EntityID   Date          Val EntityID  Date
0         10.0   Appple   10.0        100.0   Appple  10.0
1         30.0    Mango   20.0        200.0    Mango  20.0
2         15.0   Banana  300.0         15.0   Banana  30.0
3         89.0   Grapes   55.0         NaN     Nan    Nan
3         NaN      NaN       Nan        21.0    Dongo   5.0
4          NaN      NaN    NaN         33.0      Rea   9.0
5          NaN      NaN    NaN         44.0      Hpo   8.0
6          NaN      NaN    NaN         44.0     Pauy  98.0

另外,我想写一些字符串而不是NaN 'No record found for this entity ID'如果你在预期的输出中看到,Apple Apple,Banana Banana,Mango Mango等并排放置,这是用户友好的。在实际输出中,它是杂乱的。
有没有什么方法可以在Pandas中实现这一点?

p5fdfcr1

p5fdfcr11#

您可以尝试在EntityID上使用外部merge来执行此操作,在使用与合并中使用的后缀匹配的新名称复制两个 Dataframe 中的EntityID列之后。之后,将数据框限制为其中至少一个对应值对不同的那些行。例如:

mismatch = (
    raw.assign(EntityID_raw=raw["EntityID"])
    .merge(bob.assign(EntityID_bob=bob["EntityID"]),
           on="EntityID", how="outer", suffixes=("_raw", "_bob"))
    .drop(columns="EntityID")
    .loc[lambda df: df["Date_raw"].ne(df["Date_bob"]) | df["Val_raw"].ne(df["Val_bob"])]
    .fillna("No match")
)
cols = (col.split("_") for col in mismatch.columns)
mismatch.columns = pd.MultiIndex.from_tuples((s, p) for p, s in cols)

样品结果:

raw                           bob                    
       Date       Val  EntityID      Date       Val  EntityID
0      10.0      10.0    Appple      10.0     100.0    Appple
1      20.0      30.0     Mango       5.0      21.0     Mango
2      30.0      15.0    Grapes  No match  No match  No match
3      55.0      89.0    Banana      30.0      15.0    Banana
4  No match  No match  No match      20.0     200.0     Dongo
5  No match  No match  No match       9.0      33.0       Rea
6  No match  No match  No match       8.0      44.0       Hpo
7  No match  No match  No match      98.0      44.0      Pauy

相关问题