我是新来使用pyspark运行vertica的sql的。我已经导入了vertica\u python模块,我已经将sql转换为vertica表,我需要将查询结果转换为pyspark dataframe。我使用spark submit命令运行,但出现了错误,错误为“memoryerror”。有人能帮忙吗?
我试图在spark submit命令中增加和减少驱动器和执行器内存,但没有工作,得到了相同的错误消息。
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
sc=SparkContext(appName="build_history_pyspark")
sqlContext=SQLContext(sc)
start_load_time_UTC='2019-06-01 00:00:00'
start_load_time_UTC=datetime.strptime(start_load_time_UTC,'%Y-%m-%d %H:%M:%S')
url="jdbc:vertica://144.60.100.141:5433/SRVVERTICA" . #This is the vertica database host IP and port
properties={"user":"hhhhh","password":"yyyyy","driver": "com.vertica.jdbc.Driver"}
sql_for_kafka="SELECT STARTTIME, LONGITUDE,LATITUDE FROM my_table WHERE (VERTICA_LOAD_TIME >'%s')" %(start_load_time_UTC)
sdf_orig=sqlContext.read.format("JDBC").options(url=url, query=sql_for_kafka,**properties).load()
sdf_orig.show(10)
错误说明:memoryerror
我的命令:
spark submit--driver memory 10g--num executors 10--executor cores 4--executor memory 10g build\u history\u pyspark.py
我做了进一步的调试,我认为错误是由于变量start\u load\u time\u utc引起的。我真的需要在这里使用一个变量,而不是直接值。但不知道正确的语法。
我还将查询语句更改为:sql\u for \u kafka=“从我的\u表中选择starttime,longitude,latitude where vertica\u load\u time>$start\u load\u time\u utc”
我有不同的错误:
Py4JJavaError: An error occurred while calling o57.load.
: java.sql.SQLDataException: [Vertica][VJDBC](3679) ERROR: Invalid
input syntax for timestamp: "$start_load_time_UTC"
at com.vertica.util.ServerErrorData.buildException(Unknown Source)
at com.vertica.io.ProtocolStream.readExpectedMessage(Unknown Source)
at com.vertica.dataengine.VDataEngine.prepareImpl(Unknown Source)
at com.vertica.dataengine.VDataEngine.prepare(Unknown Source)
at com.vertica.dataengine.VDataEngine.prepare(Unknown Source)
at com.vertica.jdbc.common.SPreparedStatement.<init>(Unknown Source)
at com.vertica.jdbc.jdbc4.S4PreparedStatement.<init>(Unknown Source)
at com.vertica.jdbc.VerticaJdbc4PreparedStatementImpl.<init>(Unknown Source)
at com.vertica.jdbc.VJDBCObjectFactory.createPreparedStatement(Unknown Source)
at com.vertica.jdbc.common.SConnection.prepareStatement(Unknown Source)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:58)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:115)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:52)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:340)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
原因:com.vertica.support.exceptions.dataexception:[vertica]vjdbc错误:时间戳的输入语法无效:“$start\u load\u time\u utc”。。。28个以上
我最终解决了这个问题,将query语句改为:
sql_for_kafka="(SELECT STARTTIME, LONGITUDE,LATITUDE FROM my_table WHERE (VERTICA_LOAD_TIME >'{0}') )temp" .format(start_load_time_UTC)
暂无答案!
目前还没有任何答案,快来回答吧!