删除10 k文件的参数过长异常-Python-HDFS

sg3maiej 于 2022-12-09 发布在 HDFS

关注(0)|答案(1)|浏览(186)

我在HDFS文件夹中有15 k个不同日期的文件。我正在尝试删除超过10天的文件，其中我有10 k个文件要从HDFS中删除。当我试图通过传递到列表来一次删除这些10 k个文件时，我遇到了错误。您能帮助我解决这个问题吗？我尝试使用find和xargs，但它接受单个文件，而不是多个文件。

delete_cmd = 'hadoop fs -find /test/folder/* -name ' +filepath+ ' | xargs hadoop fs -rm -r -skipTrash'

OSError：[Errno 7]参数列表太长：'/bin/sh'

folder = '/test/folder/*'
no_of_days = 10
now_dt = datetime.now()

def list_files():
  hdfs_cmd = "hdfs dfs -ls -r " + folder  + " | grep / " 
  hdfs_output = subprocess.getoutput(hdfs_cmd)
  return hdfs_output

def delete_cmd(filepath):
  delete_cmd = 'hdfs dfs -rm -r -skipTrash '  + filepath 
  delete_output = subprocess.getoutput(delete_cmd)
  print(delete_output)

def delete_olderfiles():
  hdfs_outputfun = list_files()
  delete_list = []
  filepath_list = [line.split(' ')[-1] for line in hdfs_outputfun.splitlines()]
  for filepath in filepath_list:
        filename = filepath.split('/')[-1]
        filename_dt = datetime.strptime(filename, '%Y%m%d.csv')
    diff_days = (now_dt - filename_dt).days
    if diff_days > no_of_days:
        delete_list.append(filepath)
 string = " " 
 delete_folders = string.join(delete_list)
 delete_cmd(delete_folders)

delete_olderfiles()

删除文件的输出：hdfs dfs -rm -r -跳过垃圾文件20210901.csv文件20210801.csv文件20210903.csv文件...10k个文件

hdfs

来源：https://stackoverflow.com/questions/69403842/arguments-too-long-exception-to-delete-10k-files-python-hdfs

1条答案

按热度按时间

mwngjboj1#

我认为问题是命令行只能接受这么多数据作为参数，如下所述：https://unix.stackexchange.com/questions/45583/argument-list-too-long-how-do-i-deal-with-it-without-changing-my-command
在Linux上，命令参数的最大空间量是可用堆栈空间量的1/4。因此，解决方案是增加堆栈的可用空间量。
简短版本：运行类似于

ulimit -s 65536

另一种解决方案是对每个文件调用一次rm命令：

for item in delete_list:
    delete_cmd(item)

并将此代码放在delete_olderfiles()的末尾（不要在要删除的所有文件的连续列表上调用delete_cmd()）。

赞(0）回复(0）举报 2022-12-09

我来回答

删除10 k文件的参数过长异常-Python-HDFS

1条答案

相关问题

热门标签

最新问答