Python pandas使用chunk读取大型csv

rn0zuynd  于 11个月前  发布在  Python
关注(0)|答案(3)|浏览(176)

我正在尝试在阅读大型CSV文件时优化代码。
我在几个网站上看到,我们可以用"chunksize"与Pandas。
我使用代码读取csv文件:

data = pd.read_csv(zf.open(f) , skiprows=[0,1,2,3,4,5], header=None, low_memory=False)
for _, df in data.groupby(data[0].eq("No.").cumsum()):
     dfs = []
     df = pd.DataFrame(df.iloc[1:].values, columns=df.iloc[0].fillna(99))
     dfs.append(df.rename_axis(columns=None))
     date_pattern='%Y/%m/%d %H:%M'
     df['epoch'] = df.apply(lambda row: int(time.mktime(time.strptime(row.time,date_pattern))), axis=1) # create epoch as a column
     for each_column in list(df.columns)[2:-1]:
     others line with "each_column" ...
     ...

字符串
我尝试与代码波纹管与chunksize,但我得到一个错误。

data = pd.read_csv(zf.open(f) , skiprows=[0,1,2,3,4,5], header=None, low_memory=False, chunksize=1000)
    for _, df in data.groupby(data[0].eq("No.").cumsum()):
         dfs = []
         df = pd.DataFrame(df.iloc[1:].values, columns=df.iloc[0].fillna(99))
         dfs.append(df.rename_axis(columns=None))
         date_pattern='%Y/%m/%d %H:%M'
         df['epoch'] = df.apply(lambda row: int(time.mktime(time.strptime(row.time,date_pattern))), axis=1) # create epoch as a column
         for each_column in list(df.columns)[2:-1]:
         others line with "each_column"


错误:

on 0: Process Process-22:9:
Traceback (most recent call last):
  File "/usr/lib64/python3.6/multiprocessing/process.py", line 258, in _bootstrap
          self.run()
  File "/usr/lib64/python3.6/multiprocessing/process.py", line 93, in run
          self._target(*self._args, **self._kwargs)
  File "/opt/import2grafana/libs/parsecsv.py", line 236, in readcsv_multi_thread
          for _, df in data.groupby(data[0].eq("No.").cumsum()):
AttributeError: 'TextFileReader' object has no attribute 'groupby'


可以在groupby中使用chuncksize吗?
非常感谢任何帮助。

xytpbqjk

xytpbqjk1#

Chunksize将TextFileReader,

for _, df in data:

字符串
例如,DataFrame的方法应该应用于df。

data = pd.read_csv(zf.open(f) , skiprows=[0,1,2,3,4,5], header=None, low_memory=False)
for _, df in data:
    df = df.groupby(data[0].eq("No.").cumsum())

wswtfjt7

wswtfjt72#

您不能在data上使用groupby。请尝试以下操作:

dfs = []
for df in data:
   for _, subdf in chunk.groupby(data[0].eq("No.").cumsum()):
       # do stuff here
       dfs.append(subdf)

字符串

4zcjmb1e

4zcjmb1e3#

我找到了正确的方法:

chunk_size = 100000000000  # Set your desired chunk size here
  data_reader = pd.read_csv(zf.open(f), skiprows=[0, 1, 2, 3, 4, 5], header=None, low_memory=False, chunksize=chunk_size)
  for data_chunk in data_reader:
      for _, df in data_chunk.groupby(data_chunk[0].eq("No.").cumsum()):
          dfs = []
          df = pd.DataFrame(df.iloc[1:].values, columns=df.iloc[0].fillna(99))
          dfs.append(df.rename_axis(columns=None))
          date_pattern = '%Y/%m/%d %H:%M'
          df['epoch'] = df.apply(lambda row: int(time.mktime(time.strptime(row.time, date_pattern))), axis=1)
          for each_column in list(df.columns)[2:-1]:
             some stuff lines

字符串
谢谢你帮忙。

相关问题