Pandas:创建新列,其中包含满足与当前行相关的条件的最新索引

tquggr8v  于 9个月前  发布在  其他
关注(0)|答案(4)|浏览(84)

在下面的例子中,我希望返回相对于当前行的最后一个索引,其中“lower”是>=“upper”列。我能够像预期的那样使用结果做到这一点,但它不是真正的矢量化,并且对于较大的多帧来说效率很低。

import pandas as pd

# Sample DataFrame
data = {'lower': [7, 1, 6, 1, 1, 1, 1, 11, 1, 1],
        'upper': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]}

df = pd.DataFrame(data=data)

df['DATE'] = pd.date_range('2020-01-01', periods=len(data['lower']))
df['DATE'] = pd.to_datetime(df['DATE'])
df.set_index('DATE', inplace=True)

# new columns that contains the most recent index of previous rows, where the previous "lower" is greater than or equal to the current "upper"
def get_most_recent_index(row):
    previous_indices = df.loc[:row.name - pd.Timedelta(minutes=1)]  
    recent_index = previous_indices[previous_indices['lower'] >= row['upper']].index.max()
    return recent_index

df['prev'] = df.apply(get_most_recent_index, axis=1) 

print(df)

字符串
我该如何重写才能最有效?

zaqlnxep

zaqlnxep1#

我不确定这是否可以向量化(因为你有依赖于过去状态的变量)。但是你可以尝试使用二进制搜索来加速计算,例如:

from bisect import bisect_left

def get_prev(lower, upper, _date):
    uniq_lower = sorted(set(lower))
    last_seen = {}

    for l, u, d in zip(lower, upper, _date):
        # find index of element that is >= u
        idx = bisect_left(uniq_lower, u)

        max_date = None
        for lv in uniq_lower[idx:]:
            if lv in last_seen:
                if max_date is None:
                    max_date = last_seen[lv]
                elif last_seen[lv] > max_date:
                    max_date = last_seen[lv]
        yield max_date
        last_seen[l] = d

df["prev_new"] = list(get_prev(df["lower"], df["upper"], df.index))
print(df)

字符串
印刷品:

lower  upper       prev   prev_new
DATE                                          
2020-01-01      7      2        NaT        NaT
2020-01-02      1      3 2020-01-01 2020-01-01
2020-01-03      6      4 2020-01-01 2020-01-01
2020-01-04      1      5 2020-01-03 2020-01-03
2020-01-05      1      6 2020-01-03 2020-01-03
2020-01-06      1      7 2020-01-01 2020-01-01
2020-01-07      1      8        NaT        NaT
2020-01-08     11      9        NaT        NaT
2020-01-09      1     10 2020-01-08 2020-01-08
2020-01-10      1     11 2020-01-08 2020-01-08

nbysray5

nbysray52#

还有一个结果略有不同的替代答案。

import pandas as pd
import numpy as np
# Sample DataFrame
data = {'lower': [7, 1, 6, 1, 1, 1, 1, 11, 1, 1],
        'upper': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]}

df = pd.DataFrame(data=data)

df['DATE'] = pd.date_range('2020-01-01', periods=len(data['lower']))
df['DATE'] = pd.to_datetime(df['DATE'])
df['prev'] = pd.to_datetime(np.nan)

df['prev'] = np.where(df['lower'] >= df['upper'], df['DATE'], df['prev'])
df['prev'] = df['prev'].shift(1).fillna(method = 'ffill')

print(df)

  lower upper   DATE      prev
0   7   2   2020-01-01  NaT
1   1   3   2020-01-02  2020-01-01
2   6   4   2020-01-03  2020-01-01
3   1   5   2020-01-04  2020-01-03
4   1   6   2020-01-05  2020-01-03
5   1   7   2020-01-06  2020-01-03
6   1   8   2020-01-07  2020-01-03
7   11  9   2020-01-08  2020-01-03
8   1   10  2020-01-09  2020-01-08
9   1   11  2020-01-10  2020-01-08

字符串
我不知道为什么我们在中间的两个日期得到NaT。我的解决方案在这些地方没有NaT

x6yk4ghg

x6yk4ghg3#

在我的理解中,循环遍历python对象,如列表和字典,而不是pandas数组行(可能是错误的)更快。因此,下面是我尝试过的,它适用于你的输入df:

date_list=df["DATE"].values.tolist()
lower_list=df["lower"].values.tolist()
upper_list=df["upper"].values.tolist()
new_list=[]
for i,(x,y) in enumerate(zip(lower_list,upper_list)):
    if i==0:
        new_list.append(None)
    else:
        if (any(j >= y for j in lower_list[0:i])):
            

            for ll,dl in zip(reversed(lower_list[0:i]),reversed(date_list[0:i])):
                if ll>=y:
                    new_list.append(dl)
                    break
                else:
                    continue
        else:
            new_list.append(None)
df['prev']=new_list
df['prev']=pd.to_datetime(df['prev'])

字符串

jslywgbw

jslywgbw4#

您可以使用范围连接来有效地获取匹配-来自pyjanitor的conditional_join解决了这个问题。如果可以,请分享您的性能测试。

# pip install pyjanitor
import pandas as pd
import janitor

# set the DATE column as an index
# after the operation you can set the original DATE
# column as an index
left_df = df.assign(index_prev=df.index)
right_df = df.assign(index_next=df.index)
out=(left_df
    .conditional_join(
        right_df, 
        ('lower','upper','>='), 
        ('index_prev','index_next','<'), 
        df_columns='index_prev', 
        right_columns=['index_next','lower','upper'])
    )
# based on the matches, we may have multiple returns
# what we need is the closest to the current row
closest=out.index_next-out.index_prev
grouper=[out.index_next, out.lower,out.upper]
min_closest=closest.groupby(grouper).transform('min')
closest=closest==min_closest
# we have out matches, which is defined by `index_prev`
# use index_prev to get the relevant DATE
prev=out.loc[closest,'index_prev']
prev=df.loc[prev,'DATE'].array # avoid index alignment here
index_next=out.loc[closest,'index_next']
# now assign back to df, based on index_next and prev
prev=pd.Series(prev,index=index_next)
df.assign(prev=prev)

   lower  upper       DATE       prev
0      7      2 2020-01-01        NaT
1      1      3 2020-01-02 2020-01-01
2      6      4 2020-01-03 2020-01-01
3      1      5 2020-01-04 2020-01-03
4      1      6 2020-01-05 2020-01-03
5      1      7 2020-01-06 2020-01-01
6      1      8 2020-01-07        NaT
7     11      9 2020-01-08        NaT
8      1     10 2020-01-09 2020-01-08
9      1     11 2020-01-10 2020-01-08

字符串

相关问题