pandas 如何在apache beam python中合并或合并pcollection(多个pardo产量)

thtygnil 于 12个月前发布在 Apache

关注(0)|答案(1)|浏览(81)

我有一个自定义的ParDo函数，它从一个API获取数据，并在每次命中后生成一个pandas框架。
我执行了一些数据操作，最后我想把所有这些对象帧或pcollections组合成一个合并，然后把它们作为CSV文件写入磁盘。
下面是我的代码如何工作的基本表示：

class GetData(beam.DoFn):
    def __init__(self, hits):
        self.no_of_hits = hits

    def process(self, url):
        for i in range(no_of_hits):
            json = requests.get(url+no_of_hits)
            df = pd.json_normalize(json)
            yield df
    
with beam.Pipeline() as pipeline:
        data = (pipeline
            | "url to start the pipeline" >> beam.Create([url])
            | "get data from api" >>   beam.ParDo(GetData(hits)))
        wrangled = (... some basic manipulation to each dataframe)
        combine = ???

字符串
但我是新的Apache梁，所以，我不明白我怎么能做到这一点。
我尝试过使用beam.Flatten（），但它需要一个iterable作为输入。
Pcollection不是schema'd，也不是延迟梁框架
谢谢，任何帮助都是感激的

pandas

来源：https://stackoverflow.com/questions/77447874/how-to-combine-or-merge-pcollections-multiple-pardo-yields-in-apache-beam-pyth