HDFS 使用配置单元SQL获取文件系统目录大小

qybjjes1 于 2023-03-16 发布在 HDFS

关注(0)|答案(2)|浏览(184)

我每天都将文件系统信息加载到配置单元中，我只想获取所有目录大小。
我有一张这样的table

Path                   Size               Date
/                        0             01-07-2021
/tmp                     0             01-07-2021
/tmp/file1               2             01-07-2021
/tmp/file2               2             01-07-2021
/tmp/dir1                0             01-07-2021
/tmp/dir1/file3          3             01-07-2021
/opt/                    0             01-07-2021
/opt/file1               2             01-07-2021
/opt/dir1                0             01-07-2021
/opt/dir1/file2          3             01-07-2021
/opt/dir2/               0             01-07-2021
/opt/dir2/file3          4             01-07-2021
...
...
...
/                        0             02-07-2021
/tmp                     0             02-07-2021
/tmp/file1               2             02-07-2021
/tmp/file2               2             02-07-2021
/tmp/dir1                0             02-07-2021
/tmp/dir1/file3          3             02-07-2021
/opt/                    0             02-07-2021
/opt/file1               2             02-07-2021
/opt/dir1                0             02-07-2021
/opt/dir1/file2          3             02-07-2021
/opt/dir2/               0             02-07-2021
/opt/dir2/file3          4             02-07-2021

并且我想有一个输出查询或者创建一个像这样的新表。

Path                   Size               Date
/                        16            01-07-2021
/tmp                     7             01-07-2021
/tmp/dir1                3             01-07-2021
/opt                     9             01-07-2021
/opt/dir1                3             01-07-2021
/opt/dir2                4             01-07-2021
...
...
...
/                        16            02-07-2021
/tmp                     7             02-07-2021
/tmp/dir1                3             02-07-2021
/opt                     9             02-07-2021
/opt/dir1                3             02-07-2021
/opt/dir2                4             02-07-2021

我是SQL新手，请帮助我。谢谢。

hdfs

来源：https://stackoverflow.com/questions/68204800/get-the-filesystem-directory-sizes-using-hive-sql

2条答案

按热度按时间

nimxete21#

MatBaille的思路是正确的。它的思想是将目录中每一级的每一行相乘，然后进行聚合。我认为更安全的方法是使用substring_index()，它的行为与MySQL中的一样。这确实需要生成一系列数字，它使用了split(spaces)技巧：

select substring_index(path, pe.i) as path,
       max(case when path = substring_index(path, i) then date end) as date,
       sum(size) as total_size
from (select t.*,
             1 + length(path) - length(replace(path, '/', '')) as depth
      from t
     ) t lateral view
     posexplode(split(space(t.depth))) pe as i, x
group by substring_index(path, pe.i);

赞(0）回复(0）举报 2023-03-16

wxclj1h52#

我不是HiveQLMaven，但是从我所读到的内容来看，它并不很好地支持递归查询，所以这里有一个蛮力的方法。
它将文件路径拆分为多个部分，然后将它们重新聚合起来，这样每个父目录中的每个文件都是重复的，然后聚合共享父目录的每个文件。
希望有人能给出更好的答案。

select sub_path, sum(size) as size, date
from
(
  select path, size, date, concat_ws('/', collect_list(b.element)) as sub_path
  from YourTable
  lateral view outer posexplode(split(path, '/')) a as pos, element
  lateral view outer posexplode(split(path, '/')) b as pos, element
  where b.pos <= a.pos
  group by path, size, date, a.pos
)
  sub_paths
group by sub_path, date

我建议单独测试子查询，测试一个小的虚拟数据集（只是几个目录中的几个文件），这将演示它在做什么，并帮助您调试任何拼写错误。
它还假设没有混乱的目录或文件名包含转义的斜杠。这段代码不会识别出斜杠已被转义，并认为它是一个正常的斜杠，代表目录结构中的一个新级别。

赞(0）回复(0）举报 2023-03-16

我来回答

HDFS 使用配置单元SQL获取文件系统目录大小

2条答案

相关问题

热门标签

最新问答