pandas 在数据框中搜索LOV列,并替换为使用其他数据框的代码

bkhjykvo  于 2023-03-28  发布在  其他
关注(0)|答案(1)|浏览(129)

想象一下,我有一个肮脏的数据框架,里面有员工的ID和每个国家的合同相关信息。
此数据框的某些列是LOV列(取决于国家/地区,某些列仅是一个国家/地区的LOV,其他列是其中的部分或全部),某些LOV列是强制性的,而有些则不是(仅用于了解是否接受空白值)。
我们需要使用另一个Map Dataframe 进行检查:

  • 如果所提供的值存在于Map Dataframe 中,
  • 如果是,则用 Dataframe 上的对应代码替换所提供的值。

如果提供的值不在列表中,则在主 Dataframe 上创建一个名为“Errors”的新列,其中它表示出错的列的名称(如果多于1列,则可以将名称保存在该列的列表中)。
从这个dataframe:

ID  Country Contract      Type
1   CZ      Permanent     BOFF
1   ES      Fixed-term    .
2   CZ      Contractor    Front-Office
3   PT      Permanent   
4   PT      2022-01-01    Employee
4   PT      Fixed-term    Office
4   ES                    Employee
5   SK      Permanent     Employee

使用此Map:

Country   Field       Values          Code      Mandadory
CZ        Contract    Permanent       PE        Yes
CZ        Contract    Fixed-term      FX        Yes
CZ        Contract    Contractor      CT        Yes
ES        Contract    Permanent       PERMA     No
SK        Contract    Permanent       PER-01    Yes
SK        Contract    Fixed-term      FIX-01    Yes
ES        Type        Office          OFF       Yes
CZ        Type        Back-Office     BOFF      Yes 
CZ        Type        Front-Office    FOFF      Yes 
PT        Type        Employee        EMP       No
PT        Type        Front-Office    FRONT     No

会导致这个dataframe:

ID  Country Contract    Type       Errors
1   CZ      PE          BOFF       ['Type']
1   ES      Fixed-term  .          ['Contract','Type']
2   CZ      CT          FOFF    
3   PT      Permanent       
4   PT      2022-01-01  FRONT      ['Type']
4   PT      Fixed-term  Office     ['Type']
4   ES      Employee               ['Contract','Type']
5   SK      PER-01      Employee

非常感谢您的支持!

jtoj6r0c

jtoj6r0c1#

使用您提供的 Dataframe :

import pandas as pd

df = pd.DataFrame(
    {
        "ID": [1, 1, 2, 3, 4, 4, 4, 5],
        "Country": ["CZ", "ES", "CZ", "PT", "PT", "PT", "ES", "SK"],
        "Contract": [
            "Permanent",
            "Fixed-term",
            "Contractor",
            "Permanent",
            "2022-01-01",
            "Fixed-term",
            "",
            "Permanent",
        ],
        "Type": [
            "BOFF",
            ".",
            "Front-Office",
            "",
            "Employee",
            "Office",
            "Employee",
            "Employee",
        ],
    }
)
other = pd.DataFrame(
    {
        "Country": ["CZ", "CZ", "CZ", "ES", "SK", "SK", "ES", "CZ", "CZ", "PT", "PT"],
        "Field": [
            "Contract",
            "Contract",
            "Contract",
            "Contract",
            "Contract",
            "Contract",
            "Type",
            "Type",
            "Type",
            "Type",
            "Type",
        ],
        "Values": [
            "Permanent",
            "Fixed-term",
            "Contractor",
            "Permanent",
            "Permanent",
            "Fixed-term",
            "Office",
            "Back-Office",
            "Front-Office",
            "Employee",
            "Front-Office",
        ],
        "Code": [
            "PE",
            "FX",
            "CT",
            "PERMA",
            "PER-01",
            "FIX-01",
            "OFF",
            "BOFF",
            "FOFF",
            "EMP",
            "FRONT",
        ],
        "Mandadory": [
            "Yes",
            "Yes",
            "Yes",
            "No",
            "Yes",
            "Yes",
            "Yes",
            "Yes",
            "Yes",
            "No",
            "No",
        ],
    }
)

下面是使用Pandas mergeapply执行此操作的一种方法:

# Merge dataframes on relevant columns with filter on "Contract"
new_df = pd.merge(
    left=df,
    right=other.loc[other["Field"] == "Contract", ["Country", "Values", "Code"]],
    how="left",
    left_on=["Country", "Contract"],
    right_on=["Country", "Values"],
)

# Update "Contract" column
new_df["Contract"] = new_df.apply(
    lambda x: x["Code"] if not pd.isna(x["Code"]) else x["Contract"], axis=1
)

# Add new "Errors" column
new_df["Errors"] = new_df.apply(
    lambda x: [] if not pd.isna(x["Code"]) else ["Contract"], axis=1
)
new_df = new_df.drop(columns=["Values", "Code"])

# Merge dataframes on relevant columns with filter on "Type"
new_df = pd.merge(
    left=new_df,
    right=other.loc[other["Field"] == "Type", ["Country", "Values", "Code"]],
    how="left",
    left_on=["Country", "Type"],
    right_on=["Country", "Values"],
)

# Update "Type" column
new_df["Type"] = new_df.apply(
    lambda x: x["Code"] if not pd.isna(x["Code"]) else x["Type"], axis=1
)

# Update "Errors" column
new_df["Errors"] = new_df.apply(
    lambda x: x["Errors"] if not pd.isna(x["Code"]) else x["Errors"] + ["Type"], axis=1
)
new_df = new_df.drop(columns=["Values", "Code"])

然后:

print(new_df)
# Output

   ID Country    Contract      Type            Errors
0   1      CZ          PE      BOFF            [Type]
1   1      ES  Fixed-term         .  [Contract, Type]
2   2      CZ          CT      FOFF                []
3   3      PT   Permanent            [Contract, Type]
4   4      PT  2022-01-01       EMP        [Contract]
5   4      PT  Fixed-term    Office  [Contract, Type]
6   4      ES              Employee  [Contract, Type]
7   5      SK      PER-01  Employee            [Type]

相关问题