Dataproc:使用PySpark从BigQuery阅读和写入数据时出错

pn9klfpd  于 2022-11-01  发布在  Spark
关注(0)|答案(1)|浏览(165)

我正在尝试读取一些BigQuery数据,(ID:my-project.mydatabase.mytable [original names protected])从一个用户管理的Jupyter Notebook示例(在Dataproc Workbench中)中创建的。我正在尝试的是从这个示例中得到的灵感,更具体地说,代码是(请阅读一些关于代码本身的附加注解):

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, col
from pyspark.sql.types import IntegerType, ArrayType, StringType
from google.cloud import bigquery

# UPDATE (2022-08-10): BQ conector added

spark = SparkSession.builder.appName('SpacyOverPySpark') \
                    .config('spark.jars.packages', 'com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.24.2') \
                    .getOrCreate()

# ------------------ IMPORTING DATA FROM BIG QUERY --------------------------

# UPDATE (2022-08-10): This line now runs...

df = spark.read.format('bigquery').option('table', 'my-project.mydatabase.mytable').load()

# But imports the whole table, which could become expensive and not optimal

print("DataFrame shape: ", (df.count(), len(df.columns)) # 109M records & 9 columns; just need 1M records and one column: "posting"

# I tried the following, BUT with NO success:

# sql = """

# SELECT `posting`

# FROM `mentor-pilot-project.indeed.indeed-data-clean`

# LIMIT 1000000

# """

# df = spark.read.format("bigquery").load(sql)

# print("DataFrame shape: ", (df.count(), len(df.columns)))

# ------- CONTINGENCY PLAN: IMPORTING DATA FROM CLOUD STORAGE ---------------

# This section WORKS (just to enable the following sections)

# HINT: This dataframe contains 1M rows of text, under a single column: "posting"

df = spark.read.csv("gs://hidden_bucket/1M_samples.csv", header=True)

# ---------------------- EXAMPLE CUSTOM PROCESSING --------------------------

# Example Python UDF Python

def split_text(text:str) -> list:
    return text.split()

# Turning Python UDF into Spark UDF

textsplitUDF = udf(lambda z: split_text(z), ArrayType(StringType()))

# "Applying" a UDF on a Spark Dataframe (THIS WORKS OK)

df.withColumn("posting_split", textsplitUDF(col("posting")))

# ------------------ EXPORTING DATA TO BIG QUERY ----------------------------

# UPDATE (2022-08-10) The code causing the error:

# df.write.format('bigquery') \

# .option('table', 'wordcount_dataset.wordcount_output') \

# .save()

# has been replace by a code that successfully stores data in BQ:

df.write \
  .format('bigquery') \
  .option("temporaryGcsBucket", "my_temp_bucket_name") \
  .mode("overwrite") \
  .save("my-project.mynewdatabase.mytable")

当使用SQL查询从BigQuery阅读数据时,触发的错误为:

Py4JJavaError: An error occurred while calling o195.load.
: com.google.cloud.spark.bigquery.repackaged.com.google.inject.ProvisionException: Unable to provision, see the following errors:

1) Error in custom provider, java.lang.IllegalArgumentException: 'dataset' not parsed or provided.
  at com.google.cloud.spark.bigquery.SparkBigQueryConnectorModule.provideSparkBigQueryConfig(SparkBigQueryConnectorModule.java:65)
  while locating com.google.cloud.spark.bigquery.SparkBigQueryConfig

1 error
    at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalProvisionException.toProvisionException(InternalProvisionException.java:226)
    at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InjectorImpl$1.get(InjectorImpl.java:1097)
    at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InjectorImpl.getInstance(InjectorImpl.java:1131)
    at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelationInternal(BigQueryRelationProvider.scala:75)
    at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelation(BigQueryRelationProvider.scala:46)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:332)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:242)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:230)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:197)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.IllegalArgumentException: 'dataset' not parsed or provided.
    at com.google.cloud.bigquery.connector.common.BigQueryUtil.lambda$parseTableId$2(BigQueryUtil.java:153)
    at java.util.Optional.orElseThrow(Optional.java:290)
    at com.google.cloud.bigquery.connector.common.BigQueryUtil.parseTableId(BigQueryUtil.java:153)
    at com.google.cloud.spark.bigquery.SparkBigQueryConfig.from(SparkBigQueryConfig.java:237)
    at com.google.cloud.spark.bigquery.SparkBigQueryConnectorModule.provideSparkBigQueryConfig(SparkBigQueryConnectorModule.java:67)
    at com.google.cloud.spark.bigquery.SparkBigQueryConnectorModule$$FastClassByGuice$$db983008.invoke(<generated>)
    at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.ProviderMethod$FastClassProviderMethod.doProvision(ProviderMethod.java:264)
    at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.ProviderMethod.doProvision(ProviderMethod.java:173)
    at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalProviderInstanceBindingImpl$CyclicFactory.provision(InternalProviderInstanceBindingImpl.java:185)
    at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalProviderInstanceBindingImpl$CyclicFactory.get(InternalProviderInstanceBindingImpl.java:162)
    at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:40)
    at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.SingletonScope$1.get(SingletonScope.java:168)
    at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:39)
    at com.google.cloud.spark.bigquery.repackaged.com.google.inject.internal.InjectorImpl$1.get(InjectorImpl.java:1094)
    ... 18 more

将数据写入BigQuery时,错误为:

Py4JJavaError: An error occurred while calling o167.save.
: java.lang.ClassNotFoundException: Failed to find data source: bigquery. Please find packages at http://spark.apache.org/third-party-projects.html

UPDATE:(2022-09-10)将数据写入BigQuery时的错误已解决,请参考上面的代码,以及下面的注解部分。

我做错了什么?

jw5wzhpr

jw5wzhpr1#

讨论中发现的要点:
1.通过spark.jars=<gcs-uri>spark.jars.packages=com.google.cloud.spark:spark-bigquery-with-dependencies_<scala-version>:<version>将BigQuery连接器添加为依赖项。
1.请以<project>.<dataset>.<table>指定正确的表名。

  1. Dataframe 写入器的默认模式为errorifexists。写入不存在的表时,数据集必须存在,表将自动创建。写入现有表时,模式需要在df.write.mode(<mode>)...save()中设置为"append""overwrite"
    1.写入BQ表时,执行以下操作之一
    a)直接写入(自0.26.0起支持)
df.write \
  .format("bigquery") \
  .option("writeMethod", "direct") \
  .save("dataset.table")

B)或间接写入

df.write \
  .format("bigquery") \
  .option("temporaryGcsBucket","some-bucket") \
  .save("dataset.table")

请参见本文档。
1.通过SQL查询阅读BigQuery时,添加强制属性viewsEnabled=truematerializationDataset=<dataset>

spark.conf.set("viewsEnabled","true")
spark.conf.set("materializationDataset","<dataset>")

sql = """
  SELECT tag, COUNT(*) c
  FROM (
    SELECT SPLIT(tags, '|') tags
    FROM `bigquery-public-data.stackoverflow.posts_questions` a
    WHERE EXTRACT(YEAR FROM creation_date)>=2014
  ), UNNEST(tags) tag
  GROUP BY 1
  ORDER BY 2 DESC
  LIMIT 10
  """
df = spark.read.format("bigquery").load(sql)
df.show()

请参见本文档。

相关问题