运行spark sql user spark shell,异常抛出[原因:java.lang.illegalargumentexception:字段“id”不存在]

3gtaxfhh  于 2021-06-27  发布在  Hive
关注(0)|答案(0)|浏览(309)

首先,使用spark sql命令创建数据集:

spark.sql("select id ,a.userid,regexp_replace(b.tradeno,',','|') as TradeNo
,Amount ,TradeType ,TxTypeId
,regexp_replace(title,',','|') as title
,status ,tradetime ,TradeStatus
,regexp_replace(otherside,',','') as otherside
from
(
    select userid 
    from tableA
    where daykey='2018-10-30'
    group by userid
) a 
left join tableb b
on a.userid=b.userid 
where b.userid is not null")

结果是:

dataset: org.apache.spark.sql.DataFrame = [id: bigint, userid: int ... 9 more fields]

然后,使用以下命令将数据集导出为csv:

dataset.coalesce(40).write.option("delimiter", ",").option("charset", "utf-8").csv("/binlog_test/mycsv.excel")

spark任务运行时,出现以下错误:
驱动程序堆栈跟踪:
org.apache.spark.scheduler.dagscheduler.org$apache$spark$scheduler$dagscheduler$$failjobandindependentstages(dagscheduler)。scala:1430)在org.apache.spark.scheduler.dagscheduler$$anonfun$abortstage$1.apply(dagscheduler。scala:1418)在org.apache.spark.scheduler.dagscheduler$$anonfun$abortstage$1.apply(dagscheduler。scala:1417)在scala.collection.mutable.resizablearray$class.foreach(resizablearray。scala:59)在scala.collection.mutable.arraybuffer.foreach(arraybuffer。scala:48)在org.apache.spark.scheduler.dagscheduler.abortstage(dagscheduler。scala:1417)位于org.apache.spark.scheduler.dagscheduler$$anonfun$handletasksetfailed$1.apply(dagscheduler)。scala:797)在org.apache.spark.scheduler.dagscheduler$$anonfun$handletasksetfailed$1.apply(dagscheduler)。scala:797)在scala.option.foreach(option。scala:257)在org.apache.spark.scheduler.dagscheduler.handletasksetfailed(dagscheduler。scala:797)位于org.apache.spark.scheduler.dagschedulereventprocessloop.doonreceive(dagscheduler。scala:1645)在org.apache.spark.scheduler.dagschedulereventprocessloop.onreceive(dagscheduler。scala:1600)位于org.apache.spark.scheduler.dagschedulereventprocessloop.onreceive(dagscheduler。scala:1589)在org.apache.spark.util.eventloop$$anon$1.run(eventloop。scala:48)在org.apache.spark.scheduler.dagscheduler.runjob(dagscheduler。scala:623)在org.apache.spark.sparkcontext.runjob(sparkcontext。scala:1930)在org.apache.spark.sparkcontext.runjob(sparkcontext。scala:1943)在org.apache.spark.sparkcontext.runjob(sparkcontext。scala:1963)在org.apache.spark.sql.execution.datasources.fileformatwriter$$anonfun$write$1.apply$mcv$sp(fileformatwriter)。scala:127) ... 69更多原因:java.lang.illegalargumentexception:字段“id”不存在。在org.apache.spark.sql.types.structtype$$anonfun$fieldindex$1.apply(structtype。scala:290)在org.apache.spark.sql.types.structtype$$anonfun$fieldindex$1.apply(structtype。scala:290)在scala.collection.maplike$class.getorelse(maplike。scala:128)在scala.collection.abstractmap.getorelse(map。scala:59)在org.apache.spark.sql.types.structtype.fieldindex(structtype。scala:289)在org.apache.spark.sql.hive.orc.orcreation$$anonfun$6.apply(orcfileformat。scala:308)在org.apache.spark.sql.hive.orc.orcreation$$anonfun$6.apply(orcfileformat。scala:308)在scala.collection.traversablelike$$anonfun$map$1.apply(traversablelike。scala:234)在scala.collection.traversablelike$$anonfun$map$1.apply(traversablelike。scala:234)在scala.collection.iterator$class.foreach(iterator。scala:893)在scala.collection.abstractiterator.foreach(迭代器。scala:1336)在scala.collection.iterablelike$class.foreach(iterablelike。scala:72)在org.apache.spark.sql.types.structtype.foreach(structtype。scala:96) 在scala.collection.traversablelike$class.map(traversablelike。scala:234)在org.apache.spark.sql.types.structtype.map(structtype。scala:96)在org.apache.spark.sql.hive.orc.orcrelation$.setrequiredcolumns(orcfileformat。scala:308)位于org.apache.spark.sql.hive.orc.orcfileformat$$anonfun$buildreader$2.apply(orcfileformat)。scala:140)在org.apache.spark.sql.hive.orc.orcfileformat$$anonfun$buildreader$2.apply(orcfileformat)。scala:129)位于org.apache.spark.sql.execution.datasources.fileformat$$anon$1.apply(文件格式)。scala:138)位于org.apache.spark.sql.execution.datasources.fileformat$$anon$1.apply(文件格式)。scala:122)在org.apache.spark.sql.execution.datasources.filescanrdd$$anon$1.nextiterator(filescanrdd。scala:168)位于org.apache.spark.sql.execution.datasources.filescanrdd$$anon$1.hasnext(filescanrdd)。scala:109)位于org.apache.spark.sql.catalyst.expressions.generatedclass$generateEditor.processnext(未知源代码)org.apache.spark.sql.execution.bufferedrowtiterator.hasnext(bufferedrowtiterator。java:43)在org.apache.spark.sql.execution.whistagecodegenexec$$anonfun$8$$anon$1.hasnext(whistagecodegenexec。scala:377)在scala.collection.iterator$$anon$11.hasnext(iterator。scala:408)在org.apache.spark.shuffle.sort.bypassmergesortshufflewriter.write(bypassmergesortshufflewriter。java:126)在org.apache.spark.scheduler.shufflemaptask.runtask(shufflemaptask。scala:96)在org.apache.spark.scheduler.shufflemaptask.runtask(shufflemaptask。scala:53)在org.apache.spark.scheduler.task.run(task。scala:99)在org.apache.spark.executor.executor$taskrunner.run(executor。scala:325)位于java.util.concurrent.threadpoolexecutor.runworker(threadpoolexecutor。java:1142)在java.util.concurrent.threadpoolexecutor$worker.run(threadpoolexecutor。java:617)在java.lang.thread.run(线程。java:745)
但是,当我直接执行join操作use hive,并用join结果创建一个新表,最后用sparksql命令导出数据集时,一切都很顺利。

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题