从红移读取Spark非常慢

nom7f22z  于 2023-02-09  发布在  Apache
关注(0)|答案(1)|浏览(135)

我有一个需求,我需要从AWS Redshift读取数据,并在EC2节点示例上使用Apache Spark将结果作为CSV写入AWS S3 Bucket。
我正在使用io.github.spark_redshift_community.spark.redshift驱动程序通过查询从Redshift读取数据。此驱动程序执行查询并将结果以CSV格式存储在S3的临时空间中。

由于某些限制,我不想使用Athena或UNLOAD命令

我能够实现这一点,但从S3 temp_directory的读取过程非常慢。

如上所示,从S3 temp_directory读取然后写入S3位置10k条大小为2MB的记录几乎需要分钟
根据日志,我可以看出将红移数据存储到S3的temp_directory中相当快。延迟发生在从temp_directory阅读时
运行spark的EC2示例具有对S3存储桶的IAM角色访问权限。
下面是读取红移的代码

spark.read()
            .format("io.github.spark_redshift_community.spark.redshift")
            .option("url",URL)
            .option("query",    QUERY)
            .option("user", USER_ID)
            .option("password", PASSWORD)
            .option("tempdir", TEMP_DIR)
            .option("forward_spark_s3_credentials", "true")
            .load();

以下是pom.xml依赖项

<dependencies>

    <dependency>
        <groupId>com.eclipsesource.minimal-json</groupId>
        <artifactId>minimal-json</artifactId>
        <version>0.9.5</version>
    </dependency>
    <dependency>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-assembly-plugin</artifactId>
        <version>3.3.0</version>
    </dependency>
    <dependency>
        <groupId>org.ini4j</groupId>
        <artifactId>ini4j</artifactId>
        <version>0.5.4</version>
    </dependency>
    <dependency>
        <groupId>org.projectlombok</groupId>
        <artifactId>lombok</artifactId>
        <version>1.18.2</version>
    </dependency>
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-api</artifactId>
        <version>1.7.26</version>
    </dependency>

    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-avro_2.12</artifactId>
        <version>3.3.1</version>
    </dependency>

    <dependency>
        <groupId>io.github.spark-redshift-community</groupId>
        <artifactId>spark-redshift_2.12</artifactId>
        <version>4.2.0</version>
    </dependency>

    <dependency>
        <groupId>io.delta</groupId>
        <artifactId>delta-core_2.12</artifactId>
        <version>2.2.0</version>
    </dependency>

    <dependency>
        <groupId>org.scala-lang</groupId>
        <artifactId>scala-library</artifactId>
        <version>2.12.15</version>
    </dependency>

    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-aws</artifactId>
        <version>3.3.1</version>
        <scope>provided</scope>
    </dependency>

    <dependency>
        <groupId>com.amazonaws</groupId>
        <artifactId>aws-java-sdk-s3</artifactId>
        <version>1.12.389</version>
        <scope>provided</scope>
    </dependency>

    <dependency>
        <groupId>com.amazonaws</groupId>
        <artifactId>aws-java-sdk-bundle</artifactId>
        <version>1.12.389</version>
        <scope>provided</scope>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hadoop-cloud -->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-hadoop-cloud_2.12</artifactId>
        <version>3.3.1</version>
        <scope>provided</scope>
    </dependency>

    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_2.12</artifactId>
        <version>3.3.1</version>
        <scope>provided</scope>
    </dependency>

    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.12</artifactId>
        <version>3.3.1</version>
        <scope>provided</scope>
    </dependency>

    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>3.3.1</version>
        <scope>provided</scope>
    </dependency>

    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>3.3.1</version>
        <scope>provided</scope>
    </dependency>

    <dependency>
        <groupId>junit</groupId>
        <artifactId>junit</artifactId>
        <version>3.8.1</version>
        <scope>test</scope>
    </dependency>
</dependencies>
mwg9r5ms

mwg9r5ms1#

我找到了解决这个问题的办法。
原来是我使用的io.github.spark_redshift_community.spark.redshift驱动程序的4.2.0版本导致了这个问题。
当我切换到最新的版本5.1.0时,问题得到了解决,同样的工作在10秒钟内完成。
谢谢!

相关问题