使用pyspark的imageschema应用主成分分析

mftmpeh8 于 2021-05-27 发布在 Spark

关注(0)|答案(1)|浏览(409)

我有三个不同的pysparkDataframe包含图像。当我打印这些imageschema时，我有：

root
 |-- image: struct (nullable = true)
 |    |-- origin: string (nullable = true)
 |    |-- height: integer (nullable = true)
 |    |-- width: integer (nullable = true)
 |    |-- nChannels: integer (nullable = true)
 |    |-- mode: integer (nullable = true)
 |    |-- data: binary (nullable = true)

我想对这些应用pca（或者pyspark的另一个降维），但是我不知道怎么做。我想到了使用pandas\u udf，因为我看到databricks的deepimagefeaturizer现在被弃用了，pandas\u udf现在被建议使用，但是我不明白如何将它用于这种类型的数据。。。


# Different examples of lines I saw on tutorials to use pandas_udf

multiple_test_udf = pandas_udf(multiple_test_df['image.data'], returnType=?)

pandas_udf(return_type, PandasUDFType.SCALAR_ITER)

我认为imageschema的image.data是imageschema中应用pca的唯一有趣的部分，image.data是要转换为udf的部分。另外，我认为udf是pca的输入。我“只是”不明白如何在实践中做这些步骤。。。
感谢您的帮助；）
附言：我使用：
python 3.7版
pyspark 3.0版
pandas0.24我在放入anaconda3服务器的jupyter笔记本上运行代码。

apache-spark pyspark pandas jupyter-notebook python-3.x

来源：https://stackoverflow.com/questions/62662403/use-imageschema-from-pyspark-to-apply-principal-components-analysis