使用pyspark更改配置单元表后出现架构错误

yyhrrdl8 于 2021-06-26 发布在 Hive

关注(0)|答案(1)|浏览(333)

我在 hive 里有张table test 带列 id 以及 name 现在我在hive中有了另一个表mysql with columns id , name 以及 city .
现在我要比较两个表的模式，并向配置单元表添加列差异 test .

hive_df= sqlContext.table("testing.test")

mysql_df= sqlContext.table("testing.mysql")

hive_df.dtypes

[('id', 'int'), ('name', 'string')]

mysql_df.dtypes

[('id', 'int'), ('name', 'string'), ('city', 'string')]

hive_dtypes=hive_df.dtypes

hive_dtypes

[('id', 'int'), ('name', 'string')]

mysql_dtypes= mysql_df.dtypes

diff = set(mysql_dtypes) ^ set(hive_dtypes)

diff

set([('city', 'string')])

for col_name, col_type in diff:
...  sqlContext.sql("ALTER TABLE testing.test ADD COLUMNS ({0} {1})".format(col_name, col_type))
...

做了这些之后， hive 的table test 将有新列 city 按预期添加了空值。
现在，当我关闭spark会话并打开新的spark会话时

hive_df= sqlContext.table("testing.test")

然后

hive_df

我应该去

DataFrame[id: int, name: string, city: string]

但我明白了

DataFrame[id: int, name: string]

当我描述配置单元表时 test ```
hive> desc test;
OK
id int
name string
city string

为什么在我们更改相应的配置单元表之后，模式更改没有反映在pysparkDataframe中？
仅供参考，我使用的是spark 1.6

Hive apache-spark pyspark spark-dataframe

来源：https://stackoverflow.com/questions/42983607/schema-error-after-altering-hive-table-with-pyspark

1条答案

按热度按时间

zazmityj1#

看来这件事有个圣战者https://issues.apache.org/jira/browse/spark-9764 已经在spark 2.0中修复。
对于使用spark 1.6的用户，可以尝试使用 sqlContext .
就像 first register the data frame as temp table 然后再做

sqlContext.sql("create table table as select * from temptable")

通过这种方式，在您更改配置单元表和重新创建sparkDataframe之后 df 也将有新添加的列。
这个问题是在@zero323的帮助下解决的

赞(0）回复(0）举报 2021-06-26

我来回答

使用pyspark更改配置单元表后出现架构错误

1条答案

相关问题

热门标签

最新问答