从hdfs下载tarball,并即时解压

rks48beu  于 2021-05-31  发布在  Hadoop
关注(0)|答案(0)|浏览(657)

我将一个大的数据集作为(非压缩的)tarball存储在hdfs中。tarball的大小大约为250gib。
我想下载这个tarball和untar的飞行,以节省我的机器的快速固态硬盘。我希望避免先用手抓住它 hadoop fs -get ... 然后在当地解开。
目前,我正在使用 hadoop fs -cat 抓住它,用管道把它输送到焦油中 pv 对于进度条:

hadoop fs -cat my_big_tar.tar | pv -s "$TAR_SIZE" | tar xf -

然而,当我这样做的时候,我得到一些(非致命的)错误,而解压,最终的输出结果是好的,但丢失了一些数据(几十gib)。错误如下所示:

Grabbing data from hadoop and untaring on the fly...                                                                     
tar: Skipping to next header============================================================>               ] 81% ETA 0:05:40
tar: Archive contains ‘\201\021\260e\210\333\357J\201\200\av’ where numeric off_t value expected        ] 81% ETA 0:05:37
tar: Archive contains ‘W\341\034\267\t\0ꑻ\317{\374’ where numeric off_t value expected                                   
tar: Archive contains ‘s{AZ\224\235 F.\317\342d’ where numeric off_t value expected======>              ] 82% ETA 0:05:10
tar: Archive contains ‘\264\357\036\272ud.W\235cL\204’ where numeric off_t value expected====>          ] 86% ETA 0:03:50
tar: Archive contains ‘\251\203\204\236\207\374\246"\255\240i\017’ where numeric off_t value expected                    
tar: Archive contains ‘T\242\b[(\372\357*e\032\255S’ where numeric off_t value expected======>          ] 87% ETA 0:03:46
tar: Archive contains ‘\300굕\277t\025o\207\373CK’ where numeric off_t value expected========>           ] 87% ETA 0:03:37
tar: Archive base-256 value is out of off_t range=============================================>         ] 88% ETA 0:03:26
tar: Archive contains ‘\204\274\234\366z\335<D\201-\306\361’ where numeric off_t value expected         ] 88% ETA 0:03:24
tar: Archive contains ‘\341ֶ\207\334-5\034\267C\v\017’ where numeric off_t value expected======>        ] 88% ETA 0:03:18
tar: Archive contains ‘c\3307\247\343ჯ\033瓸’ where numeric off_t value expected===============>         ] 89% ETA 0:03:11
tar: Archive contains ‘Vj+&!\242f$\212\374_\276’ where numeric off_t value expected=============>       ] 91% ETA 0:02:35
tar: Archive contains ‘\v5\374\273\375\302e\251ݝ\247O’ where numeric off_t value expected=======>       ] 91% ETA 0:02:33
tar: Archive contains ‘\027ȷJ\316j\203\025\027\033\264R’ where numeric off_t value expected=====>       ] 91% ETA 0:02:21
tar: Archive contains ‘Ks[L\325x\005\341\301’ where numeric off_t value expected================>       ] 92% ETA 0:02:19
tar: Archive contains obsolescent base-64 headers================================================>      ] 92% ETA 0:02:12
<snip>
tar: Archive contains ‘\177q\375\230Y<QE\0\367\242\207’ where numeric off_t value expected=============> ] 99% ETA 0:00:1
tar: Archive contains ‘\264e\260k\340,d\206\242^\022\032’ where numeric off_t value expected===========> ] 99% ETA 0:00:0
tar: Exiting with failure status due to previous errors================================================> ] 99% ETA 0:00:0
 260GB 0:28:54 [ 154MB/s] [============================================================================>] 100%

首先从hadoop复制数据,使用 hadoop fs -get my_tar.tar 然后解开就行了。
这是我的 hadoop version 输出:

Hadoop 2.7.2
Subversion Unknown -r Unknown
Compiled by <redacted> on 2016-04-21T22:04Z
Compiled with protoc 2.5.0
From source with checksum d0fda26633fa762bff87ec759ebe689c
This command was run using /opt/hadoop/hadoop-bin-2.7.2-1/share/hadoop/common/hadoop-common-2.7.2.jar

完整脚本位于此处:https://github.com/andreibarsan/dotfiles/blob/master/bin/get-hdfs-tar.sh
在使用时,是什么导致了这些错误 hadoop fs -cat ? (也许一些散乱的hadoop日志输出混入了tar读取的管道中?我怎么检查呢?)

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题