pandas 是否有办法以并行模式运行此任务，从而提高速度？

qgelzfjb 于 2023-02-07 发布在其他

关注(0)|答案(1)|浏览(117)

这是我的代码，我正在一个大数据集（37k行）上应用这个函数，我想在多线程上运行，或者用其他方法让它更快，我已经测试了Spark和Dask库，但是我陷入了无法解决的错误。如果你们有任何想法，那就太好了。

import matplotlib.pyplot as plt

def caption_from_image_file(x):
    return [str(get_caption(i,device)) for i in x.load()]

import cv2
import numpy as np

df = dg.getData("train")

df_test = df

# start timer  
import time
start_time = time.time()

df_test['captions'] = df_test.images.apply(caption_from_image_file)

# end timer (in minutes)
print("--- %s minutes ---" % ((time.time() - start_time)/60))

df_test.to_csv('test.csv',index=False)

# # free up cuda memory
torch.cuda.empty_cache()

df_test.captions

pandas

来源：https://stackoverflow.com/questions/75347988/is-there-a-way-to-run-this-task-in-a-parallel-mode-so-that-it-is-faster

1条答案

按热度按时间

8dtrkrch1#

欢迎来到堆栈溢出社区，请考虑@Jérome的评论。我看到你开发了一个预定义的函数。

def caption_from_image_file(x):
    return [str(get_caption(i,device)) for i in x.load()]

看到您正在使用的方法，您正在破坏并行处理机制，因为您正在使用的for循环必须逐个遍历生成列表中的所有数组。
请注意，这是您必须处理的代码部分。不幸的是，并行化尚未在Pandas中实现。我建议您看看我们自2013年以来打开的这个线程：https://github.com/pandas-dev/pandas/issues/5751
我建议你看一下这个Python多线程的文档，以帮助你开发你的预定义函数：https://docs.python.org/3/library/threading.html
此链接也可以帮助您：multithreading for data from dataframe pandas

赞(0）回复(0）举报 2023-02-07

我来回答

pandas 是否有办法以并行模式运行此任务，从而提高速度？

1条答案

相关问题

热门标签

最新问答