pyspark AWS Glue框架构造函数警告

8mmmxcuj  于 12个月前  发布在  Spark
关注(0)|答案(2)|浏览(124)

运行此方法时,

## Read incoming data from Amazon S3
def readFromRawBucket(bucket_name, bucket_prefix, schema_name, table_name):
    return glue_context.create_dynamic_frame_from_options(
        connection_type = 's3', 
        connection_options = {
            'paths': [f's3://{bucket_name}/{bucket_prefix}/{schema_name}/{table_name}'], 
            'groupFiles': 'none', 
            'recurse': True
        }, 
        format = 'parquet',
        transformation_ctx = table_name
    ).toDF()

我收到以下警告

/home/glue_user/spark/python/pyspark/sql/dataframe.py:127: UserWarning: DataFrame constructor is internal. Do not directly use it.
  warnings.warn("DataFrame constructor is internal. Do not directly use it.")

出现这种情况的原因是什么?要删除此警告,必须进行哪些更改?

1aaf6o9v

1aaf6o9v1#

如果你看一下Spark源代码(使用3.4.1版本,截至本文发布时的最新版本),你会看到这个警告被触发的地方:

def __init__(
        self,
        jdf: JavaObject,
        sql_ctx: Union["SQLContext", "SparkSession"],
    ):
        from pyspark.sql.context import SQLContext

        self._sql_ctx: Optional["SQLContext"] = None

        if isinstance(sql_ctx, SQLContext):
            assert not os.environ.get("SPARK_TESTING")  # Sanity check for our internal usage.
            assert isinstance(sql_ctx, SQLContext)
            # We should remove this if-else branch in the future release, and rename
            # sql_ctx to session in the constructor. This is an internal code path but
            # was kept with a warning because it's used intensively by third-party libraries.
            warnings.warn("DataFrame constructor is internal. Do not directly use it.")
            self._sql_ctx = sql_ctx
            session = sql_ctx.sparkSession
        else:
            session = sql_ctx
        self._session: "SparkSession" = session

实际上这个参数,叫做sql_ctx,必须是SparkSession而不是SQLContext。从评论中可以清楚地看到,这是一个过渡问题。但是没有必要担心,因为session = sql_ctx.sparkSessionSQLContext中创建了必要的SparkSession
你看到这个的原因是因为aws-glue在它的.toDF方法中使用了glue_ctx

def toDF(self, options = None):
        ...
        return DataFrame(self._jdf.toDF(self.glue_ctx._jvm.PythonUtils.toSeq(scala_options)), self.glue_ctx)

这个glue_ctx是从Spark的SQLContext继承的GlueContext
所以最后,如果Spark将其API更改为只接受X1 m12n1x方法中的SparkSession,则必须使用glue。但目前情况并非如此。如果不重写.toDF方法,您将无法删除警告。但你不必担心这一点,只是看到这些警告有点烦人:)

nzk0hqpo

nzk0hqpo2#

我假设它是由调用.toDF()引起的,它将专有的Glue Dynamic Frame转换为常规的Spark Data Frame。

相关问题