使用PyArrow从HDFS读取parquet文件

rt4zxlrg 于 2023-05-27 发布在 HDFS

关注(0)|答案(3)|浏览(366)

我知道我可以使用pyarrow.hdfs.connect()通过pyarrow连接到HDFS群集
我还知道可以使用pyarrow.parquet的read_table()读取 parquet 文件
但是，read_table()接受文件路径，而hdfs.connect()给我一个HadoopFileSystem示例。
有没有可能只使用pyarrow（安装了libhdfs 3）来获取驻留在HDFS集群中的parquet文件/文件夹？我希望得到的是to_pydict()函数，然后我可以传递数据。

hdfs

来源：https://stackoverflow.com/questions/47443151/read-a-parquet-files-from-hdfs-using-pyarrow

3条答案

按热度按时间

2w3rbyxf1#

试试看

fs = pa.hdfs.connect(...)
fs.read_parquet('/path/to/hdfs-file', **other_options)

或

import pyarrow.parquet as pq
with fs.open(path) as f:
    pq.read_table(f, **read_options)

我打开了https://issues.apache.org/jira/browse/ARROW-1848，想添加一些更明确的文档

赞(0）回复(0）举报 2023-05-27

7xzttuei2#

我通过Pydoop库和engine = pyarrow尝试了同样的方法，它对我来说非常有效。

!pip install pydoop pyarrow
import pydoop.hdfs as hd

#read files via Pydoop and return df

def readParquetFilesPydoop(path):
    with hd.open(path) as f:
        df = pd.read_parquet(f ,engine='pyarrow')
        logger.info ('file: ' +  path  +  ' : ' + str(df.shape))
        return df

赞(0）回复(0）举报 2023-05-27

fxnxkyjh3#

你可以阅读和写作与pyarrow所描绘的接受的答案。然而，那里提供的API早就被弃用了，并且不适用于最新版本的Hadoop。用途：

from pyarrow import fs
import pyarrow.parquet as pq

# connect to hadoop
hdfs = fs.HadoopFileSystem('hostname', 8020) 

# will read single file from hdfs
with hdfs.open_input_file(path) as pqt:
     df = pq.read_table(pqt).to_pandas()

# will read directory full of partitioned parquets (ie. from spark)
df = pq.ParquetDataset(path, hdfs).read().to_pandas()

赞(0）回复(0）举报 2023-05-27

我来回答

使用PyArrow从HDFS读取parquet文件

3条答案

相关问题

热门标签

最新问答