tensorflow 从parquet解析使用petastorm生成的数据集的最有效方法

jxct1oxe 于 2023-05-29 发布在 Storm

关注(0)|答案(1)|浏览(433)

bounty还有2天到期。此问题的答案有资格获得+50声望奖励。haneulkim希望引起更多关注这个问题。

版本：Python3.7.13，Tensorflow-2.9.1，Petastorm-0.12.1
我正在尝试实现数据加载框架，使用petastorm从存储在S3中的parquet创建tf.data.Dataset。
创建数据集如下：

cols = [col1_nm, col2_nm, ...]
def parse(e):
    x_features = []
    for c in cols:
        x_features.append(getattr(e,c))
    X = tf.stack(x_features, axis=1)
    y = getattr(e, 'target')
    return X, y

with make_batch_reader(s3_paths, schema_fields=cols+['target']) as reader:
    dataset = make_petastorm_dataset(reader).map(parse)
    for e in dataset.take(3):
        print(e)

一切都很好，但想知道是否有替代（更有效和可维护）的方法。
在解析之前，dataset的类型为DatasetV1Adapter，并且dataset中的每个元素（e）（通过dataset.take（1）获得）的类型为inferred_schema_view，其由每个特征的EagerTensor组成。
我已经尝试使用index来分割X，y，但是通过[-1]阅读最后一个元素不会返回target的eagerTensor。

tensorflow

来源：https://stackoverflow.com/questions/76260072/most-efficient-way-to-parse-dataset-generated-using-petastorm-from-parquet

1条答案

按热度按时间

hyrbngr71#

为了使用Petastorm从存储在S3中的Parquet创建TensorFlow数据集，实现更高效和可维护的实现，我尝试使用tf.data.Dataset.from_generator方法。这允许定义产生数据示例和标签的生成器函数。
下面是一个如何使用tf.data.dataset.from_generator修改代码的示例：

import tensorflow as tf
from petastorm import make_batch_reader, make_petastorm_dataset

cols = ['col1_nm', 'col2_nm', ...]

def generator_fn():
    with make_batch_reader(s3_paths, schema_fields=cols + ['target']) as reader:
        for e in reader:
            x_features = [getattr(e, c) for c in cols]
            y = getattr(e, 'target')
            yield x_features, y

def parse(x_features, y):
    X = tf.stack(x_features, axis=1)
    return X, y

dataset = tf.data.Dataset.from_generator(generator_fn, output_signature=(
    tf.TensorSpec(shape=(len(cols),), dtype=tf.float32),
    tf.TensorSpec(shape=(), dtype=tf.float32)
))

parsed_dataset = dataset.map(parse)

for e in parsed_dataset.take(3):
    print(e)

1.generator_fn函数被定义为生成数据示例（x_features）和标签（y）的生成器函数。在generator函数中调用make_batch_reader来创建读取器并迭代数据。

parse函数将x_features和y作为参数，并处理它们以创建所需的X和y格式。
1.tf.data.Dataset.from_generator用于从generator_fn创建数据集。output_signature参数用于指定生成元素的形状和数据类型。
1.然后使用parse函数Map结果数据集，对每个元素应用必要的处理。
与直接在Petastorm数据集上使用map函数相比，这种方法应该可以对数据加载过程提供更好的控制和灵活性，并且可以更有效和更可维护。
P.S：我没有测试代码。但它可能会给予你一些想法。

赞(0）回复(0）举报 2023-05-29

我来回答

tensorflow 从parquet解析使用petastorm生成的数据集的最有效方法

1条答案

相关问题

热门标签

最新问答