pandas 基于时间间隔标记父子项的Python(Performant)方法

bkhjykvo 于 2022-12-09 发布在 Python

关注(0)|答案(1)|浏览(161)

我有以下示例 Dataframe ：

import pandas as pd

df = pd.DataFrame([[1, 12], [4, 9], [6, 7], [10, 11]],
     index=['A', 'B', 'C', 'D'],
     columns=['start', 'end'])
     
print(df)

基于开始和结束（时间间隔用Ns表示），我要标记父节点的子节点，例如：

start  end parent
A      1   12      A # can be NA
B      4    9      A
C      6    7      B
D     10   11      A

现在，我提出了这个（O(N^2)）：

df.sort_values(by='start', inplace=True, ascending=False)
df["parent"] = ['A', 'A', 'A', 'A']

for row in df.itertuples():
    for ref_row in df.itertuples():
        if (row.start > ref_row.start) & (row.end < ref_row.end):
            df.loc[row.Index, "parent"] = ref_row.Index
            break

df.sort_values(by='start', inplace=True)        
print(df)

这是可行的，但显然效率很低。请建议一个有效的解决方案--也许使用间隔。

谢谢-谢谢

pandas

来源：https://stackoverflow.com/questions/74694982/pythonic-performant-way-to-mark-parent-child-based-on-interval

1条答案

按热度按时间

zed5wv101#

尝试列表理解，它比循环快几倍，一个就足够了。
在每次迭代中，都会调用my_unc函数来检查条件。使用隐式iloc索引，其中行索引在左边，列号在右边。如果获得多个匹配，则通过aaa[0]获取第一个匹配。
请注意，数据列：C 6 7匹配'A'、' B'。如果没有找到任何内容，函数将返回'A'。最后，将结果列表替换到'parent'列中。
在你的例子中，不需要第二个循环。在我的例子中，你可以删除.index，结果将是过滤的行。

df.sort_values(by='start', inplace=True, ascending=False)
df["parent"] = ['A', 'A', 'A', 'A']

def my_func(x):
    aaa = df[(df.iloc[x, 0] > df['start']) & (df.iloc[x, 1] < df['end'])].index
    if len(aaa) > 0:
        aaa = aaa[0]#take only the first value if there are several
    else:
        aaa = 'A'#if there is nothing then return 'A'

    return aaa

df['parent'] = [my_func(i) for i in range(len(df))]

df.sort_values(by='start', inplace=True)

print(df)

没有函数的变量。

df.sort_values(by='start', inplace=True, ascending=False)

df["parent"] = [df[(df.iloc[i, 0] > df['start']) &
                   (df.iloc[i, 1] < df['end'])].index for i in range(len(df))]

df["parent"] = df['parent'].str[0]
df.fillna(value='A', inplace=True)#fill in 'A' where there were no matches

df.sort_values(by='start', inplace=True)

为准确核对计算时间，计算前先放：

import datetime

now = datetime.datetime.now()

并在末尾使用以下行：

time_ = datetime.datetime.now() - now
print('elapsed time', time_)

为准确核对计算时间，计算前先放：并在末尾使用以下行：

赞(0）回复(0）举报 2022-12-09

我来回答

pandas 基于时间间隔标记父子项的Python(Performant)方法

1条答案

相关问题

热门标签

最新问答