处理org.apache.arrow.vector.util.LocationException错误

bgibtngc 于 2021-07-13 发布在 Spark

关注(0)|答案(0)|浏览(214)

我有一个sparkDataframe，它有两列，probability和true label（来自二进制分类器）。数据已经提供给我了（所以我不会独自执行任何mlib操作）。数据的大小约为500米，我有一个集群8工人的112 gb内存（1 gpu，5 dbus）。我需要计算roc\u auc\u分数。我正在使用Pandas自定义项以及sklearn的roc\u auc\u评分指标。
代码如下：

@pandas_udf("double")
      def pandas_auc(label: pd.Series,probability: pd.Series) -> float:
        score = roc_auc_score(y_true= label, y_score=probability)
        return score

# execute

training_auc = self.df.select(pandas_auc('label','Average_Score')).first()[0]

这会引发错误：

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 30.0 failed 4 times, most recent failure: Lost task 0.3 in stage 30.0 (TID 7923, 10.70.176.13, executor 3): org.apache.arrow.vector.util.OversizedAllocationException: Memory required for vector capacity 264305678 is (2147483648), which is more than max allowed (2147483647)

我注意到它可以处理较少的数据（~100m），但是当我的数据是~500时，我该如何处理呢？

apache-spark pyspark apache-spark-sql

来源：https://stackoverflow.com/questions/66093310/handling-org-apache-arrow-vector-util-oversizedallocationexception-error

暂无答案！

目前还没有任何答案，快来回答吧！

我来回答

处理org.apache.arrow.vector.util.LocationException错误

暂无答案！

相关问题

热门标签

最新问答