HDFS 如何从Pyspark的DataFrame中获取数值列并计算zscore

mum43rcc  于 2022-12-09  发布在  HDFS
关注(0)|答案(2)|浏览(242)
sparkSession = SparkSession.builder.appName("example").getOrCreate()
df = sparkSession.read.json('hdfs://localhost/abc/zscore/')

I am able to read the data from hdfs and I want to calculate the zscore for only numeric columns

zf2sa74q

zf2sa74q1#

您可以将df转换为Pandas并计算zscore

sparkSession = SparkSession.builder.appName("example").getOrCreate()
df = sparkSession.read.json('hdfs://localhost/SmartRegression/zscore/').toPandas()
num_cols = df._get_numeric_data().columns
results = df[num_cols].apply(zscore)
print(results)
ufj5ltwl

ufj5ltwl2#

toPandas()不适用于大数据集,因为它尝试将整个数据集加载到驱动程序内存中。

相关问题