spark配置单元查询联接错误

xjreopfe  于 2021-06-26  发布在  Hive
关注(0)|答案(0)|浏览(169)

我有如下cloudera集群规范:

我创建了简单的sparksql应用程序来连接配置单元表。表是外部表。对于healtpersonalcare\u reviews表,数据是用json文件编写的;对于healtpersonalcare\u ratings表,数据是用csv格式(115mb)编写的。这是我的密码:

val warehouseLocation = "/hive/warehouse"
var args_list      = args.toList
var conf = new SparkConf()
  .set("spark.sql.warehouse.dir", warehouseLocation)
  .set("spark.kryoserializer.buffer.max","1024m")

val spark = SparkSession
  .builder()
  .appName("Spark Hive Example")
  .config(conf)
  .enableHiveSupport()
  .getOrCreate()

val table_view_name = args_list(0)
val limit = args_list(1)

val df_addjar = spark.sql("ADD JAR /opt/cloudera/parcels/CDH/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar")

var df_use =spark.sql("use testing")
var df = spark.sql("SELECT hp.asin, hp.helpful,hp.overall,hp.reviewerid,hp.reviewername,hp.reviewtext,hp.reviewtime,hp.summary,hp.unixreviewtime FROM testing.healtpersonalcare_reviews hp LEFT JOIN testing.health_ratings hr ON (hp.reviewerid = hr.reviewerid) ")
var df_create_join_table = spark.sql("CREATE TABLE IF NOT EXISTS healtpersonalcare_joins (asin string,helpful array<int>,overall double,reviewerid string,reviewername string,reviewtext string,reviewtime string,summary string,unixreviewtime int)")

df.cache()
df.collect().foreach(println)

System.exit(0)

我使用以下命令运行应用程序:
spark submit--class org.sia.chapter03app.app--master yarn--deploy mode client--executor memory 1024m--driver memory 1024m--conf spark.driver.maxresultsize=2g--verbose/root/sparktest/original-chapter03app-0.0.1-snapshot.jar name 10
我尝试使用值的变化--执行器内存和--驱动程序内存
对于“--executor memory 1024m--driver memory 1024m”i get error“java.lang.outofmemoryerror:java堆空间”
对于“--executor memory 2048m--driver memory 2048m”,线程“main”java.lang.outofmemoryerror:超出gc开销限制
有人遇到过这样的问题吗?解决办法是什么?谢谢。

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题