HDFS 如何保持DataprocYarnnm-local-dir大小可管理

4zcjmb1e  于 2022-12-09  发布在  HDFS
关注(0)|答案(1)|浏览(193)

I am running a spark job on a GCP Dataproc cluster configured with 1 master, 2 primary workers (4 local SSDs each used for shuffling) and N secondary workers (without any SSD).
My job process the data by daily batches, hence I expect the temporary data (shuffles, checkpoints, etc.) to grow during the process of one day and to be cleaned-up before starting the next day.
However, when I run the job for some days (like 15-20 days), it finally fails with an error about not enough disk space for yarn (I cannot remember the exact error message).
The Dataproc GCP console show this graph where you can see that the HDFS capacity monotonously decrease while the other metrics exhibits cyclics up and down related batch start/stop.

I logged myself to one of the primary workers to investigate SSDs usage as I remember that error message referred to /mnt/{1,2,3,4} which are the mount point of the SSDs.

$ df -h
[...]
/dev/sdb        369G  178G  172G  51% /mnt/1
/dev/sdc        369G  178G  173G  51% /mnt/2
/dev/sdd        369G  179G  171G  52% /mnt/3
/dev/sde        369G  174G  176G  50% /mnt/4
[...]

And the disk usage keep increasing (it was at 43% before I wrote this post). Digging further lead me to the following directory:

$ pwd
/mnt/1/hadoop/yarn/nm-local-dir/application_1622441105718_0001

$ du -sh .
178G    .

$ ls
00  03  06  09  0c  0f  12  15  18  1b  1e  21  24  27  2a  2d  30  33  36  39  3c  3f
01  04  07  0a  0d  10  13  16  19  1c  1f  22  25  28  2b  2e  31  34  37  3a  3d
02  05  08  0b  0e  11  14  17  1a  1d  20  23  26  29  2c  2f  32  35  38  3b  3e

All these folders contains files named shuffle_NN_NNNN_0.{index,data} . A lot of theses files: 38577 currently.
I suppose that these files are temporary data but why are they not deleted after each batch? Can I delete them manually without breaking my job (something find . -type f -mmin -120 -delete to delete all files older than 120 minutes (my batches are about 60 minutes long))? Is there a good way to manage these files?

Edit

Actually, I tested to delete old files with something like:

for I in $(seq 1 4); do
( cd /mnt/${I}/hadoop/yarn/nm-local-dir/application_* && sudo find . -type f ! -newerat 2021-05-31T12:00:00 -delete )
done

My job is still running and did not seem to notice anything but I reduce disks usage to 16% (instead of 54%). This is a manual solution, I am still searching for a better one.

Edit 2

Some more precision after the answer of @Ben Sidhom.
I use Spark in batch mode. The "nominal" use case is to process the data of the last day, each morning. So, each day D, I launch a Spark job reading the data of day D-1, transform it and save the resulting dataset. This process roughly spend 1 hour and in this case, I don't notice any data leak.
However, sometimes, we need to do some catch-up. For example, I implement a new transformation job and need to apply it on each day of the previous year (to populate some historical data). In this case, instead of launching 365 separated spark jobs, I launch 1 which will process each day in sequence. Now, the job is way longer, and data leaks happen. After about 15 hours (so after processing 15 days of data) the job fails because no space left on device.
Disabling EFM does not seem to be a good solution, because actually, it is on this kind of long-running job that it brings the most value, avoiding job failing because a preemptible node was lost.
So for the moment, I will stick with the manual solution. Note that the delete command should be done on all primary nodes.

Edit 3

One more edit to present my "production grade" solution.
After creating a cluster. Connect via ssh to each of your primary workers, then start a screen and run an infinite loop deleting old files (older than 120 minutes in the example below) every 600 seconds:

$ screen -RD
<new screen created>

$ while true; do
date --iso-8601=seconds
for I in $(seq 1 4); do
( cd /mnt/${I}/hadoop/yarn/nm-local-dir/application_* && sudo find . -type f -amin +120 -delete )
done
df -h | grep /mnt/
sleep 600
done

While this command is runnning, detach yourself from the screen ( Ctrl+A D ). You can check that your command is still running with htop .
That's it.

ulmd4ohb

ulmd4ohb1#

有一个已知的问题,即此中间shuffle数据可能会泄漏。在此期间,可能的解决方法有:

  • 手动删除这些暂存文件的解决方案。
  • 现在禁用EFM。在这种情况下,暂存文件由YARN直接管理,不依赖Spark进行清理。

你能弄清楚你使用的Spark是什么模式吗?这是一个重复运行的批处理作业,还是一个长时间运行的带有小批处理的流作业?如果是后者,Spark清理钩子可能只在作业结束时发出,这可以解释泄漏。

更新:自2021年7月20日发布以来,shuffle数据泄漏问题已得到修复

相关问题