我正在尝试使用pyspark将红移表读入emr集群。我目前正在shell上使用pyspark运行我的代码,但最终我想制作一个脚本,我可以使用pyspark提交 spark-submit
. 我使用4个jar文件使pyspark能够连接并从redshift读取数据。
我开始使用pyspark: pyspark --jars minimal-json-0.9.5.jar,RedshiftJDBC4-no-awssdk-1.2.41.1065.jar,spark-avro_2.11-3.0.0.jar,spark-redshift_2.10-2.0.1.jar
然后我运行以下代码:
key = "<key>"
secret = "<secret>"
redshift_url = "jdbc:redshift://<cluster>:<port>/<dbname>?user=<username>&password=<password>"
redshift_query = "select * from test"
redshift_temp_s3 = "s3a://{}:{}@<bucket-name>/".format(key, secret)
data = spark.read.format("com.databricks.spark.redshift")
.option("url", redshift_url)
.option("query", redshift_query)
.option("tempdir", redshift_temp_s3)
.option("forward_spark_s3_credentials", "true")
.load()
错误堆栈跟踪:
WARN Utils$: An error occurred while trying to read the S3 bucket lifecycle configuration
com.amazonaws.services.s3.model.AmazonS3Exception: The specified bucket is not valid. (Service: Amazon S3; Status Code: 400; Error Code: InvalidBucketName; Request ID: FS6MDX8P2MBG5T0G; S3 Extended Request ID: qH1q9y1C2EWIozr3WH2Qt7ujoBCpwLuJW6W77afE2SKrDiLOnKvhGvPC8mSWxDKmR6Dx0AlyoB4=; Proxy: null), S3 Extended Request ID: qH1q9y1C2EWIozr3WH2Qt7ujoBCpwLuJW6W77afE2SKrDiLOnKvhGvPC8mSWxDKmR6Dx0AlyoB4=
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1828)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1412)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1374)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1145)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802)
然后等待几秒钟,然后显示正确的输出。我还看到了在s3 bucket中创建的文件夹。我没有启用bucket版本控制,但它确实创建了一个生命周期。我不明白为什么它先显示错误,然后再显示正确的输出。
任何帮助都将不胜感激。
暂无答案!
目前还没有任何答案,快来回答吧!