pandas 基于代码列表过滤数据框架，但是所讨论的列的每个值都包含许多键的列表

c6ubokkw 于 2023-02-11 发布在其他

关注(0)|答案(2)|浏览(110)

我的数据（df1）看起来像这样：

INC_KEY        AISPREDOT
180008916795   "[110402.0, 110602.0, 140651.0, 140694.0, 150402.0, 161002.0]"
180008916796   "[140655.0, 140694.0]"
180008916797   "[853151.0]"
180008916798   "[110402.0, 140652.0, 150202.0]"
180008916799   "[857300.0]"
180008916800   "[650634.0]"
180008916801   "[710402.0, 772430.0, 854362.0, 854456.0, 877131.0]"
180008916802   "[816018.0, 854472.0]"
180008916803   "[442200.0, 442202.0, 450203.0]"
180008916804   "[853151.0]"

其中INC_KEY被设置为索引。我也有一个代码列表：

codes = [110402.0, 854362.0]

正如你所看到的，每个索引包含一个不同代码的列表（AISPREDOT），但是这个列表在 Dataframe 中是一个字符串，我需要以某种方式读取这些字符串作为一个列表，然后过滤df1并创建一个新的 Dataframe df2，其中df2只包含那些包含至少列表 codes 中的一个代码的索引。
因此，生成的 Dataframe （df2）将如下所示：

INC_KEY        AISPREDOT
180008916795   "[110402.0, 110602.0, 140651.0, 140694.0, 150402.0, 161002.0]"
180008916798   "[110402.0, 140652.0, 150202.0]"
180008916801   "[710402.0, 772430.0, 854362.0, 854456.0, 877131.0]"

我该如何着手实现这一目标？

pandas

来源：https://stackoverflow.com/questions/75377412/filter-dataframe-based-on-a-list-of-codes-but-each-value-of-the-column-in-quest

2条答案

按热度按时间

jdgnovmf1#

看起来像是正则表达式和str.contains的一个很好的用例：

codes = [110402.0, 854362.0]

pattern = fr"\b(?:{'|'.join(map(str, codes))})\b"
# '\\b(?:110402.0|854362.0\\b'

out = df.loc[df['AISPREDOT'].str.contains(pattern)]

输出：

INC_KEY                                                       AISPREDOT
0  180008916795  "[110402.0, 110602.0, 140651.0, 140694.0, 150402.0, 161002.0]"
3  180008916798                                "[110402.0, 140652.0, 150202.0]"
6  180008916801            "[710402.0, 772430.0, 854362.0, 854456.0, 877131.0]"

regex demo

赞(0）回复(0）举报 2023-02-11

htrmnn0y2#

使用ast.literal_eval将字符串转换为列表，然后分解列表并选择正确的行：

import ast

idx = (df['AISPREDOT'].str.strip('"').map(ast.literal_eval).explode()
                      .isin(codes).loc[lambda x: x].index)
out = df.loc[np.unique(idx)]
print(out)

# Output
                                                      AISPREDOT
INC_KEY                                                        
180008916795  "[110402.0, 110602.0, 140651.0, 140694.0, 1504...
180008916798                   "[110402.0, 140652.0, 150202.0]"
180008916799                                       "[857300.0]"

您还可以使转换持久化：

df['AISPREDOT'] = df['AISPREDOT'].str.strip('"').map(ast.literal_eval)
idx = df['AISPREDOT'].explode().isin(codes).loc[lambda x: x].index
out = df.loc[np.unique(idx)]
print(out)

# Output
                                                      AISPREDOT
INC_KEY                                                        
180008916795  [110402.0, 110602.0, 140651.0, 140694.0, 15040...
180008916798                     [110402.0, 140652.0, 150202.0]
180008916799                                         [857300.0]

赞(0）回复(0）举报 2023-02-11

我来回答

pandas 基于代码列表过滤数据框架，但是所讨论的列的每个值都包含许多键的列表

2条答案

相关问题

热门标签

最新问答