如何在hadoop中解压新目录中的.gz文件？

zyfwsgd6 于 2021-05-29 发布在 Hadoop

关注(0)|答案(5)|浏览(539)

我在hdfs的一个文件夹中有一堆.gz文件。我想将所有这些.gz文件解压到hdfs中的一个新文件夹中。我该怎么做？

hadoop hdfs GZIP

来源：https://stackoverflow.com/questions/34573279/how-to-unzip-gz-files-in-a-new-directory-in-hadoop

5条答案

按热度按时间

vcudknz31#

hadoop的 FileUtil 班级有 unTar() 以及 unZip() 实现这一点的方法。这个 unTar() 方法将起作用 .tar.gz 以及 .tgz 还有文件。不幸的是，它们只处理本地文件系统上的文件。你得用同一个班的 copy() 方法在需要使用的任何分布式文件系统之间进行复制。

赞(0）回复(0）举报 2021-05-30

dgsult0t2#

如果您有压缩文本文件，hadoopfs-text支持gzip以及其他常见的压缩格式（snappy、lzo）。

hadoop fs -text /tmp/a.gz | hadoop fs -put - /tmp/uncompressed_a

赞(0）回复(0）举报 2021-05-30

zqry0prt3#

我可以通过三种不同的方式来实现它。
使用linux命令行
服从命令对我有效。

hadoop fs -cat /tmp/Links.txt.gz | gzip -d | hadoop fs -put - /tmp/unzipped/Links.txt

我的gzip文件是 Links.txt.gz 输出存储在 /tmp/unzipped/Links.txt 使用java程序
在 Hadoop The Definitve Guide 书上有一节 Codecs . 在该部分中，有一个程序使用 CompressionCodecFactory . 我正在按原样重新生成代码：

package com.myorg.hadooptests;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;

import java.io.InputStream;
import java.io.OutputStream;
import java.net.URI;

public class FileDecompressor {
    public static void main(String[] args) throws Exception {
        String uri = args[0];
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(uri), conf);
        Path inputPath = new Path(uri);
        CompressionCodecFactory factory = new CompressionCodecFactory(conf);
        CompressionCodec codec = factory.getCodec(inputPath);
        if (codec == null) {
            System.err.println("No codec found for " + uri);
            System.exit(1);
        }
        String outputUri =
        CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension());
        InputStream in = null;
        OutputStream out = null;
        try {
            in = codec.createInputStream(fs.open(inputPath));
            out = fs.create(new Path(outputUri));
            IOUtils.copyBytes(in, out, conf);
        } finally {
            IOUtils.closeStream(in);
            IOUtils.closeStream(out);
        }
    }
}

此代码将gz文件路径作为输入。
您可以这样执行：

FileDecompressor <gzipped file name>

例如，当我执行gzip文件时：

FileDecompressor /tmp/Links.txt.gz

我在以下位置得到了解压文件： /tmp/Links.txt 它将解压后的文件存储在同一个文件夹中。因此您需要修改此代码以获取2个输入参数： <input file path> and <output folder> .
一旦这个程序运行起来，就可以编写一个shell/perl/python脚本来为每个输入调用这个程序。
使用pig脚本
您可以编写一个简单的pig脚本来实现这一点。
我写了以下脚本，很有效：

A = LOAD '/tmp/Links.txt.gz' USING PigStorage();
Store A into '/tmp/tmp_unzipped/' USING PigStorage();
mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt
rm /tmp/tmp_unzipped/

运行此脚本时，解压缩的内容存储在临时文件夹中： /tmp/tmp_unzipped . 此文件夹将包含

/tmp/tmp_unzipped/_SUCCESS
/tmp/tmp_unzipped/part-m-00000

这个 part-m-00000 包含解压缩的文件。
因此，我们需要使用以下命令显式地重命名它，最后删除 /tmp/tmp_unzipped 文件夹：

mv /tmp/tmp_unzipped/part-m-00000 /tmp/unzipped/Links.txt
rm /tmp/tmp_unzipped/

因此，如果使用这个pig脚本，只需注意参数化文件名（links.txt.gz和links.txt）。
同样，一旦这个脚本正常工作，就可以编写一个shell/perl/python脚本来为每个输入调用这个pig脚本。

赞(0）回复(0）举报 2021-05-30

bnlyeluc4#

bash解决方案

在我的例子中，我不想通过管道解压文件，因为我不确定它们的内容。相反，我想确保zip文件中的所有文件都将放在hdfs上。
我创建了一个简单的bash脚本。评论应该能给你一个线索。下面有一个简短的描述。


# !/bin/bash

workdir=/tmp/unziphdfs/
cd $workdir

# get all zip files in a folder

zips=$(hadoop fs -ls /yourpath/*.zip | awk '{print $8}')
for hdfsfile in $zips
do
    echo $hdfsfile

    # copy to temp folder to unpack
    hdfs dfs -copyToLocal $hdfsfile $workdir

    hdfsdir=$(dirname "$hdfsfile")
    zipname=$(basename "$hdfsfile")

    # unpack locally and remove
    unzip $zipname
    rm -rf $zipname

    # copy files back to hdfs
    files=$(ls $workdir)
    for file in $files; do
       hdfs dfs -copyFromLocal $file $hdfsdir
       rm -rf $file
    done

    # optionally remove the zip file from hdfs?
    # hadoop fs -rm -skipTrash $hdfsfile
done

说明

获取所有 *.zip 文件在 hdfs 目录
一个接一个：复制 zip 到临时目录（在文件系统上）
解压
将所有提取的文件复制到zip文件的目录
清理
我设法让它使用sub-dir结构来处理每个文件中的许多zip文件，使用 /mypath/*/*.zip .
祝你好运：）

赞(0）回复(0）举报 2021-05-30

vktxenjb5#

您可以使用配置单元（假设它是文本数据）来实现这一点。

create external table source (t str) location '<directory_with_gz_files>';
create external table target (t str) location '<target_dir>';
insert into table target select * from source;

数据将被解压成新的文件集。
如果不想更改名称，并且运行的节点上有足够的存储空间，则可以这样做。

hadoop fs -get <your_source_directory> <directory_name>
It will create a directory where you run hadoop command. cd to it and gunzip all the files
cd ..
hadoop fs -moveFromLocal <directory_name> <target_hdfs_path>

赞(0）回复(0）举报 2021-05-29

我来回答

如何在hadoop中解压新目录中的.gz文件？

5条答案

bash解决方案

说明

相关问题

热门标签

最新问答