如何选择一个Pandas dataframe,并附加一个不会改变结果的条件?

sr4lhrrt  于 2023-04-19  发布在  其他
关注(0)|答案(3)|浏览(142)

我有一个这样的dataframe:
| | 通货|显示符号|菲吉|伊辛|麦克风|shareClassFIGI|符号|类型|
| --------------|--------------|--------------|--------------|--------------|--------------|--------------|--------------|--------------|
| 0|美元|广东广电网络|BBG014HVCMB9|无|XNAS公司||广东广电网络|权益WRT|
| 1|美元|DCHPF|BBG00D8RQQS7|无|OOTC|BBG001SG1ZV8|DCHPF|普通股|
| 二|美元|雷|BBG0142KKR10|无|ARCX|BBG0142KKSD5|雷|ETP|
| 三|美元|TRNIF|BBG01FZS3RZ4|无|OOTC|BBG00M1B7Y30|TRNIF|封闭式基金|
| 四|美元|POTN|BBG000BXHT20|无|OOTC|BBG001S7BTV1|POTN|普通股|
| ……|……|……|……|……|……|……|……|……|
| 28030|美元|Au|BBG000BCMDR8|无|XNYS|BBG001S5NYW0|Au|ADR|
| 28031|美元|ECCC|BBG011DWM8Z1|无|XNYS||ECCC|公众|
| 28032|美元|BZLFF|BBG000C0V028|无|OOTC|BBG001S61C32|BZLFF|普通股|
| 28033|美元|DRIV|BBG00KLHY7D7|无|XNAS公司|BBG00KLHY836|DRIV|ETP|
| 28034|美元|AHAHF|BBG00CSP14G0|无|OOTC|BBG001S9YD10|AHAHF|普通股|
我想通过多个条件选择一些行,如下所示:

dirty_data = df[
    (df['description'] == '')       # condition 1
    | (df['description'] == 'Test') # condition 2
    | (df['shareClassFIGI'] == '')  # condition 3
    | ...
    ]

这个代码安排让我能够注解出一些条件,以便轻松查看:

dirty_data = df[
    (df['description'] == '')
    # | (df['description'] == 'Test')
    # | (df['shareClassFIGI'] == '')
    | ...
    ]

但注解掉第一个条件会引发错误:

dirty_data = df[
    # (df['description'] == '')
    | (df['description'] == 'Test')  # Invalid syntax! There is a `|` before the first condition
    | (df['shareClassFIGI'] == '')
    | ...
    ]

我在所有条件之前添加False,它解决了我的问题:

dirty_data = df[
    False
    | (df['description'] == '')     # Now I can comment out this line without errors
    | (df['description'] == 'Test')
    | (df['shareClassFIGI'] == '')
    | ...
    ]

但有时我想注解掉所有的contendions来显示和检查原始 Dataframe ,但它会引发KeyError

dirty_data = df[
    False                           # KeyError: False
    # | (df['description'] == '')
    # | (df['description'] == 'Test')
    # | (df['shareClassFIGI'] == '')
    # | ...
    ]

dirty_data = df[False]是无效语法。
有没有一个表达式可以替代False,以满足我的需要?

编辑:

我觉得我原来的逻辑代码是错的,没有条件就返回truthy,有条件就返回falsy,这不容易。

cyvaqqii

cyvaqqii1#

恕我直言,注解不应该是代码的任何部分,所以我会尽可能避免做你所要求的事情。如果你必须用不同的条件列表来切片 Dataframe ,为什么不像这样编写一个函数呢:

def slice_with_cond(df: pd.DataFrame, conditions: List[pd.Series]=None) -> pd.DataFrame:
    if not conditions:
        return df

    # or use `np.logical_or.reduce` as in cs95's answer
    agg_conditions = False
    for cond in conditions:
        agg_conditions = agg_conditions | cond
        
    return df[agg_conditions]

然后你可以切片:

# return df
slice_with_cond(df)

# return a slice
slice_with_cond(df, [cond1])
4dc9hkyq

4dc9hkyq2#

虽然我同意@Quang Hoang,这里有一个如何实现你想要的选择:

  • 定义一个长度为len(df.index)True的布尔掩码,因此它将始终选择所有数据
  • 通过&而不是|将所选内容与要应用的其他条件组合

这应该可以确保,每当你删除(或注解掉)除了第一个条件之外的所有条件时,整个数据框都会被选中。
示例:

import pandas as pd

data = {
    'currency': ['USD', 'USD', 'USD', 'USD', 'USD'],
    'displaySymbol': ['GDNRW', 'DCHPF', 'RAYE', 'TRNIF', 'POTN'],
    'figi': ['BBG014HVCMB9', 'BBG00D8RQQS7', 'BBG0142KKR10', 'BBG01FZS3RZ4', 'BBG000BXHT20'],
    'isin': ['None', 'None', 'None', 'None', 'None'],
    'mic': ['XNAS', 'AHA', 'ARCX', 'OOTC', 'OOTC'],
    'shareClassFIGI': ['', 'BBG001SG1ZV8', 'BBG0142KKSD5', 'BBG00M1B7Y30', 'BBG001S7BTV1'],
    'symbol': ['GDNRW', 'DCHPF', 'RAYE', 'TRNIF', 'POTN'],
    'type': ['Equity WRT', 'Common Stock', 'ETP', 'Closed-End Fund', 'Common Stock']
}

df = pd.DataFrame(data)

mask_true = [True] * len(df.index)

选择定义了某些条件的数据:

df[
    (mask_true) # make sure to return all data
    & (         # define remaining conditions
        (df['type'] == 'ETP')
        | (df['mic'] == 'OOTC')
    )
    
]

输出

currency    displaySymbol   figi    isin    mic shareClassFIGI  symbol  type
2   USD RAYE    BBG0142KKR10    None    ARCX    BBG0142KKSD5    RAYE    ETP
3   USD TRNIF   BBG01FZS3RZ4    None    OOTC    BBG00M1B7Y30    TRNIF   Closed-End Fund
4   USD POTN    BBG000BXHT20    None    OOTC    BBG001S7BTV1    POTN    Common Stock

选择除默认布尔掩码外的所有条件都已删除的数据

df[
    (mask_true) # make sure to return all data
    # & (         # define remaining conditions
    #     (df['type'] == 'ETP')
    #     | (df['mic'] == 'OOTC')
    # )
]

输出

currency    displaySymbol   figi    isin    mic shareClassFIGI  symbol  type
0   USD GDNRW   BBG014HVCMB9    None    XNAS        GDNRW   Equity WRT
1   USD DCHPF   BBG00D8RQQS7    None    AHA BBG001SG1ZV8    DCHPF   Common Stock
2   USD RAYE    BBG0142KKR10    None    ARCX    BBG0142KKSD5    RAYE    ETP
3   USD TRNIF   BBG01FZS3RZ4    None    OOTC    BBG00M1B7Y30    TRNIF   Closed-End Fund
4   USD POTN    BBG000BXHT20    None    OOTC    BBG001S7BTV1    POTN    Common Stock

希望这就是你一直在寻找的。

xdnvmnnf

xdnvmnnf3#

我通过创建一个helper函数得到了一个想法:

def hp(df, empty_selection, boolean_indexing):
    if type(boolean_indexing) == pd.core.series.Series:
        return boolean_indexing
    if type(boolean_indexing) == bool:
        return [empty_selection] * len(df)
    raise Exception('boolean_indexing type error')

# Show all data (without any condition)
dirty_data = df[ hp(df, True, False
    # | (df['description'] == '')       # condition 1
    # | (df['description'] == 'Test') # condition 2
    # | (df['shareClassFIGI'] == '')  # condition 3
    # | ...
    )]

# Show the results for condition 3
dirty_data = df[ hp(df, True, False
    # | (df['description'] == '')       # condition 1
    # | (df['description'] == 'Test') # condition 2
    | (df['shareClassFIGI'] == '')  # condition 3
    # | ...
    )]

为了简化代码并提高易用性,可以将该函数添加到DataFrame类中:

def search(df, boolean_indexing):
    if type(boolean_indexing) == pd.core.series.Series:
        return df[boolean_indexing]
    if type(boolean_indexing) == bool:
        return df
    raise Exception('boolean_indexing type error')

# Show all data (without any condition)
r = search(df, False
    # | (df['description'] == '')       # condition 1
    # | (df['description'] == 'Test') # condition 2
    # | (df['shareClassFIGI'] == '')  # condition 3
    )

# Show the results for condition 3
r = search(df, False
    # | (df['description'] == '')       # condition 1
    # | (df['description'] == 'Test') # condition 2
    | (df['shareClassFIGI'] == '')  # condition 3
    )

# You can also bind the function to the class DataFrame
pd.core.frame.DataFrame.search = search

# Show the results for condition 3
r = df.search(False
    # | (df['description'] == '')       # condition 1
    # | (df['description'] == 'Test') # condition 2
    | (df['shareClassFIGI'] == '')  # condition 3
    )

现在代码排列就像一个搜索栏。默认情况下显示所有数据,当有任何条件输入时显示结果。您可以打开/关闭任何条件。您可以通过在代码中添加一行来尝试任何条件,而无需编写样板代码(创建一个Jupyter Notebook单元格,再次写入df[...]),并通过注解将条件保留为搜索历史。以这种方式,你可以专注于条件和结果来理解数据。
有什么建议吗?

相关问题