pandas 什么是最快的方式来读取一个csv文件排序的数据然后写入另一个csv排序的数据

2g32fytz 于 2023-09-29 发布在其他

关注(0)|答案(4)|浏览(90)

我有一个庞大的数据集~ 600 Gb的几个csv文件。每个csv文件包含1.3mil x 17组数据。它看起来像这样

index        duration  is_buy_order       issued        location_id  min_volume  order_id        price   range  system_id  type_id  volume_remain  volume_total  region_id    http_last_modified  station_id  constellation_id universe_id
0              90          True  2021-05-04T23:31:50Z     60014437           1  5980151223         5.05  region   30000001       18         249003        250000   10000001  2021-06-19T16:45:32Z  60014437.0          20000001         eve
1              90          True  2021-04-29T07:40:27Z     60012145           1  5884280397         5.01  region   30000082       18          13120        100000   10000001  2021-06-19T16:45:32Z  60012145.0          20000012         eve
2              90         False  2021-04-28T11:46:09Z     60013867           1  5986716666     12500.00  region   30000019       19            728           728   10000001  2021-06-19T16:45:32Z  60013867.0          20000003         eve
3              90         False  2021-05-22T14:13:15Z     60013867           1  6005466300      6000.00  region   30000019       19           5560          9191   10000001  2021-06-19T16:45:32Z  60013867.0          20000003         eve
4              90         False  2021-05-27T08:14:29Z     60013867           1  6008912593      5999.00  region   30000019       19              1             1   10000001  2021-06-19T16:45:32Z

我现在把它放在一个dataframe里。我运行它通过一些逻辑过滤出所有的数据由一个特定的“区域_id”im寻找然后把它放进一个空的dataframe。大概是这样的：

path = pathlib.Path('somePath')
data = pd.read_csv(path)
region_index = data.columns.get_loc('region_id')

newData = pd.DataFrame(columns=data.columns)

for row in data.values:
  if row[region_index] == region.THE_FORGE.value:
    
    newData.loc[len(newData)] = row.tolist()
  
newData.to_csv(newCSVName, index=False)

然而，这需要~ 74分钟才能通过一个文件...我需要这样做超过600 GB的文件价值...
因此，正如标题所提到的，我可以/应该做的最快的方式是什么，我可以在所有的csv上迭代地做。我曾经考虑过使用async，但不确定这是否是最好的方法。

pandas

来源：https://stackoverflow.com/questions/77146829/what-is-the-fastest-way-to-read-a-csv-file-sort-the-data-then-write-the-sorted-d

4条答案

按热度按时间

pgccezyw1#

pandas提供了优化的基于C的函数，这些函数使用本机数据类型处理整个表。当您迭代行、查看单个值并将内容转换为列表时，pandas必须将其原生数据类型转换为python对象，这可能会很慢。当您分配新行时，pandas必须复制到目前为止您已经构建的表，并且随着表的增长，这会变得越来越昂贵。
看起来您可以通过一个已知的region_id过滤dataframe并直接写入csv

path = pathlib.Path('somePath')
data = pd.read_csv(path)
data[data['region_id'] == region.THE_FORGE.value]].to_csv(newCSVName, index=False)

赞(0）回复(0）举报 2023-09-29

cngwdvgl2#

所以，我可能会坚持使用csv模块。但是如果你想使用pandas，你需要使用它设计使用的向量化操作，否则，你的效率非常低，特别是，你创建newData的方式是二次时间。但是你只需要使用一个简单的过滤操作，你所需要的就是：

data = pd.read_csv(path)
data[data['region_id'] == region.THE_FORGE.value].to_csv(newCSVName, index=False)

赞(0）回复(0）举报 2023-09-29

8ehkhllq3#

20秒已经很不错了
这是我的解决方案可能会快一点，因为你想处理600GB的数据

def csv_filter_function(input_path, output_path, filter_value):
    filter_value = int(filter_value)
    target_index = 0
    with open(input_path, 'r', encoding='utf-8') as input_file:
        line = input_file.readline()
        target_index = line.split(",").index('region_id')
        with open(output_path, 'w', encoding='utf-8') as output_file:
            output_file.write(line)
            while line:= input_file.readline():
                if int(line.split(",")[target_index]) == filter_value:
                    output_file.write(line)

csv_filter_function('file.csv', 'out.csv', region.THE_FORGE.value)

赞(0）回复(0）举报 2023-09-29

falq053o4#

@juanpa.arrivillaga建议使用csv模块，这是另一个不错的选择。它在python中的循环较慢，但它不会将整个文件读入内存，这是它自己的优势。下面是第二个逐行读写的解决方案

import csv

path = pathlib.Path('somePath')
region_id = str(region.THE_FORGE.value)

with open(path, newline="") as infile, open(newCSVName, "w", newline="") as outfile:
    reader = csv.reader(infile)
    writer = csv.writer(outfile)
    header = next(reader)
    index = header.index('region_id')
    writer.writerow(header)
    writer.writerows(row for row in reader
        if row[index] == region_id)

赞(0）回复(0）举报 2023-09-29

我来回答

pandas 什么是最快的方式来读取一个csv文件排序的数据然后写入另一个csv排序的数据

4条答案

相关问题

热门标签

最新问答