pandas 一种为非常大的 Dataframe 列表运行for循环的更快方法

i86rm4rw 于 2022-12-16 发布在其他

关注(0)|答案(1)|浏览(212)

我正在使用两个for循环在彼此的内部来计算一个值，使用 Dataframe 列表中的元素组合。该列表由大量的 Dataframe 组成，使用两个for循环需要相当多的时间。
有什么方法能让我做得更快吗？
我用哑名称引用的函数是我计算结果的函数。
我的代码如下所示：

conf_list = []

 for tr in range(len(trajectories)):
     df_1 = trajectories[tr]

     if len(df_1) == 0:
        continue
   
     for tt in range(len(trajectories)):
         df_2 = trajectories[tt]

         if len(df_2) == 0:
            continue

         if df_1.equals(df_2) or df_1['time'].iloc[0] > df_2['time'].iloc[-1] or df_2['time'].iloc[0] > df_1['time'].iloc[-1]:
            continue

         df_temp = cartesian_product_basic(df_1,df_2)
    
         flg, df_temp = another_function(df_temp)
    
         if flg == 0:
             continue

         flg_h = some_other_function(df_temp)
    
         if flg_h == 1:
            conf_list.append(1)

我的输入列表由大约5000个 Dataframe 组成，看起来像（有几百行）
| 身份证|x| Y型|z|时间|
| - ------|- ------|- ------|- ------|- ------|
| 1个|五个|七|第二章|五个|
我所做的是得到两个 Dataframe 组合的笛卡尔积，并为每个对计算另一个值'c'。如果这个值c满足一个条件，那么我向我的c_list添加一个元素，以便我可以得到满足要求的对的最终数量。
如需进一步信息;
a_function（df_1，df_2）是获得两个 Dataframe 的笛卡尔积的函数。
另一个函数如下所示：

def another_function(df_temp):
      df_temp['z_dif'] =      nwh((df_temp['time_x'] == df_temp['time_y'])
                                          , abs(df_temp['z_x']-  df_temp['z_y']) , np.nan)

      df_temp = df_temp.dropna() 

      df_temp['vert_conf'] = nwh((df_temp['z_dif'] >= 1000)
                                          , np.nan , 1)
      df_temp = df_temp.dropna() 

      if len(df_temp) == 0:
       flg = 0
      else:
       flg = 1
    
      return flg, df_temp

而some_other_function看起来像这样：

def some_other_function(df_temp):
      df_temp['x_dif'] =   df_temp['x_x']*df_temp['x_y']
      df_temp['y_dif'] = df_temp['y_x']*df_temp['y_y']
      df_temp['hor_dif'] = hypot(df_temp['x_dif'], df_temp['y_dif'])

      df_temp['conf'] = np.where((df_temp['hor_dif']<=5)
                                          , 1 , np.nan)
      if df_temp['conf'].sum()>0:
         flg_h = 1
    
     return flg_h

pandas

来源：https://stackoverflow.com/questions/71276796/faster-way-to-run-a-for-loop-for-a-very-large-dataframe-list

1条答案

按热度按时间

jk9hmnmh1#

以下是使代码运行速度更快的方法：

使用列表解析代替for-loop。
使用map，filter，sum等内置函数，这会使你的代码更快。
例如，不要使用.“”或点操作符

Import datetime
A=datetime.datetime.now() #dont use this 
From datetime.datetime import now as timenow
A=timenow()# use this

使用基于c/c++的操作库，比如numpy。
不要不必要地转换数据类型。
在无限循环中，使用1代替“True“
使用内置库。
如果数据不变，则将其转换为元组
使用字符串串联
使用多个分配
使用发生器
使用if-else检查布尔值时，避免使用赋值运算符。

# Instead of Below approach
if a==1:
    print('a is 1')
else:
    print('a is 0')

# Try this approach 
if a:
    print('a is 1')
else:
    print('a is 0')

# This would help as a portion of time is reduce which was used in check the 2 values.

有用参考：

赞(0）回复(0）举报 2022-12-16

我来回答

pandas 一种为非常大的 Dataframe 列表运行for循环的更快方法

1条答案

相关问题

热门标签

最新问答