我有一个需求,我需要从AWS Redshift读取数据,并在EC2节点示例上使用Apache Spark将结果作为CSV写入AWS S3 Bucket。
我正在使用io.github.spark_redshift_community.spark.redshift
驱动程序通过查询从Redshift读取数据。此驱动程序执行查询并将结果以CSV格式存储在S3的临时空间中。
由于某些限制,我不想使用Athena或UNLOAD
命令
我能够实现这一点,但从S3 temp_directory
的读取过程非常慢。
如上所示,从S3 temp_directory
读取然后写入S3位置10k条大小为2MB的记录几乎需要分钟
根据日志,我可以看出将红移数据存储到S3的temp_directory
中相当快。延迟发生在从temp_directory
阅读时
运行spark的EC2示例具有对S3存储桶的IAM角色访问权限。
下面是读取红移的代码
spark.read()
.format("io.github.spark_redshift_community.spark.redshift")
.option("url",URL)
.option("query", QUERY)
.option("user", USER_ID)
.option("password", PASSWORD)
.option("tempdir", TEMP_DIR)
.option("forward_spark_s3_credentials", "true")
.load();
以下是pom.xml
依赖项
<dependencies>
<dependency>
<groupId>com.eclipsesource.minimal-json</groupId>
<artifactId>minimal-json</artifactId>
<version>0.9.5</version>
</dependency>
<dependency>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.3.0</version>
</dependency>
<dependency>
<groupId>org.ini4j</groupId>
<artifactId>ini4j</artifactId>
<version>0.5.4</version>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<version>1.18.2</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.26</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_2.12</artifactId>
<version>3.3.1</version>
</dependency>
<dependency>
<groupId>io.github.spark-redshift-community</groupId>
<artifactId>spark-redshift_2.12</artifactId>
<version>4.2.0</version>
</dependency>
<dependency>
<groupId>io.delta</groupId>
<artifactId>delta-core_2.12</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.12.15</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
<version>3.3.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-s3</artifactId>
<version>1.12.389</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-bundle</artifactId>
<version>1.12.389</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hadoop-cloud -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hadoop-cloud_2.12</artifactId>
<version>3.3.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.3.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.3.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.3.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.3.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
</dependencies>
1条答案
按热度按时间mwg9r5ms1#
我找到了解决这个问题的办法。
原来是我使用的
io.github.spark_redshift_community.spark.redshift
驱动程序的4.2.0
版本导致了这个问题。当我切换到最新的版本
5.1.0
时,问题得到了解决,同样的工作在10秒钟内完成。谢谢!