从外部分区的配置单元表分隔符读取问题

qvk1mo1f  于 2021-06-28  发布在  Hive
关注(0)|答案(1)|浏览(394)

我有一个外部分区的配置单元表,它的下划线文件行格式分隔字段以“|”结尾,通过配置单元直接读取数据就可以了,但是在使用spark的dataframeapi时,没有考虑分隔符“|”。
创建外部分区表:

hive> create external table external_delimited_table(value1 string, value2 string)
partitioned by (year string, month string, day string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
location '/client/edb/poc_database/external_delimited_table';

创建仅包含一行的数据文件并将其放置到外部分区表位置:

shell>echo "one|two" >> table_data.csv
shell>hadoop fs -mkdir -p /client/edb/poc_database/external_delimited_table/year=2016/month=08/day=20
shell>hadoop fs -copyFromLocal table_data.csv /client/edb/poc_database/external_delimited_table/year=2016/month=08/day=20

激活分区:

hive> alter table external_delimited_table add partition (year='2016',month='08',day='20');

健全性检查:

hive> select * from external_delimited_table;
select * from external_delimited_table;
+----------------------------------+----------------------------------+--------------------------------+---------------------------------+-------------------------------+--+
| external_delimited_table.value1  | external_delimited_table.value2  | external_delimited_table.year  | external_delimited_table.month  | external_delimited_table.day  |
+----------------------------------+----------------------------------+--------------------------------+---------------------------------+-------------------------------+--+
| one                              | two                              | 2016                           | 08                              | 20

Spark代码:

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkContext, SparkConf}
object TestHiveContext {

  def main(args: Array[String]): Unit = {

    val conf = new SparkConf().setAppName("Test Hive Context")

    val spark = new SparkContext(conf)
    val hiveContext  = new HiveContext(spark)

    val dataFrame: DataFrame = hiveContext.sql("SELECT * FROM external_delimited_table")
    dataFrame.show()

    spark.stop()
  }

dataframe.show()输出:

+-------+------+----+-----+---+
| value1|value2|year|month|day|
+-------+------+----+-----+---+
|one|two|  null|2016|   08| 20|
+-------+------+----+-----+---+
9rnv2umw

9rnv2umw1#

这是spark版本1.5.0的一个问题。在版本1.6.0中,未发生问题:

scala> sqlContext.sql("select * from external_delimited_table")
res2: org.apache.spark.sql.DataFrame = [value1: string, value2: string, year: string, month: string, day: string]

scala> res2.show
+------+------+----+-----+---+
|value1|value2|year|month|day|
+------+------+----+-----+---+
|   one|   two|2016|   08| 20|
+------+------+----+-----+---+

相关问题