我有一个大Dataframe(1.2gb左右)具有以下结构:
+---------+--------------+------------------------------------------------------------------------------------------------------+
| country | date_data | text |
+---------+--------------+------------------------------------------------------------------------------------------------------+
| "EEUU" | "2016-10-03" | "T_D: QQWE\nT_NAME: name_1\nT_IN: ind_1\nT_C: c1ws12\nT_ADD: Sec_1_P\n ...........\nT_R: 45ee" |
| "EEUU" | "2016-10-03" | "T_D: QQAA\nT_NAME: name_2\nT_IN: ind_2\nT_C: c1ws12\nT_ADD: Sec_1_P\n ...........\nT_R: 46ee" |
| . | . | . |
| . | . | . |
| "EEUU" | "2016-10-03" | "T_D: QQWE\nT_NAME: name_300000\nT_IN: ind_65\nT_C: c1ws12\nT_ADD: Sec_1_P\n ...........\nT_R: 47aa" |
+---------+--------------+------------------------------------------------------------------------------------------------------+
行数为300.000,“文本”字段是大约5000个字符的字符串。
我想在新字段中分隔字段“文本”:
+---------+------------+------+-------------+--------+--------+---------+--------+------+
| country | date_data | t_d | t_name | t_in | t_c | t_add | ...... | t_r |
+---------+------------+------+-------------+--------+--------+---------+--------+------+
| EEUU | 2016-10-03 | QQWE | name_1 | ind_1 | c1ws12 | Sec_1_P | ...... | 45ee |
| EEUU | 2016-10-03 | QQAA | name_2 | ind_2 | c1ws12 | Sec_1_P | ...... | 45ee |
| . | . | . | . | . | . | . | . | |
| . | . | . | . | . | . | . | . | |
| . | . | . | . | . | . | . | . | |
| EEUU | 2016-10-03 | QQWE | name_300000 | ind_65 | c1ws12 | Sec_1_P | ...... | 47aa |
+---------+------------+------+-------------+--------+--------+---------+--------+------+
目前,我´我用正则表达式来解决这个问题。首先,我编写正则表达式并创建一个函数从文本中提取单个字段(总共90个正则表达式):
val D_text = "((?<=T_D: ).*?(?=\\\\n))".r
val NAME_text = "((?<=nT_NAME: ).*?(?=\\\\n))".r
val IN_text = "((?<=T_IN: ).*?(?=\\\\n))".r
val C_text = "((?<=T_C: ).*?(?=\\\\n))".r
val ADD_text = "((?<=T_ADD: ).*?(?=\\\\n))".r
.
.
.
.
val R_text = "((?<=T_R: ).*?(?=\\\\n))".r
//UDF function:
def getFirst(pattern2: scala.util.matching.Regex) = udf(
(url: String) => pattern2.findFirstIn(url) match {
case Some(texst_new) => texst_new
case None => "NULL"
case null => "NULL"
}
)
然后,我创建了一个新的dataframe(tbl\u separate\u fields),作为使用正则表达式的函数从文本中提取每个新字段的结果。
val tbl_separate_fields = hiveDF.select(
hiveDF("country"),
hiveDF("date_data"),
getFirst(D_text)(hiveDF("texst")).alias("t_d"),
getFirst(NAME_text)(hiveDF("texst")).alias("t_name"),
getFirst(IN_text)(hiveDF("texst")).alias("t_in"),
getFirst(C_text)(hiveDF("texst")).alias("t_c"),
getFirst(ADD_text)(hiveDF("texst")).alias("t_add"),
.
.
.
.
getFirst(R_text)(hiveDF("texst")).alias("t_r")
)
最后,我将这个Dataframe插入一个配置单元表:
tbl_separate_fields.registerTempTable("tbl_separate_fields")
hiveContext.sql("INSERT INTO TABLE TABLE_INSERT PARTITION (date_data) SELECT * FROM tbl_separate_fields")
这个解决方案在整个Dataframe中持续1小时,因此我希望优化并减少执行时间。有什么解决办法吗?
我们使用的是hadoop2.7.1和apachespark1.5.1。spark的配置为:
val conf = new SparkConf().set("spark.storage.memoryFraction", "0.1")
val sc = new SparkContext(conf)
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
提前谢谢。
编辑数据:
+---------+--------------+------------------------------------------------------------------------------------------------------+
| country | date_data | text |
+---------+--------------+------------------------------------------------------------------------------------------------------+
| "EEUU" | "2016-10-03" | "T_D: QQWE\nT_NAME: name_1\nT_IN: ind_1\nT_C: c1ws12\nT_ADD: Sec_1_P\n ...........\nT_R: 45ee" |
| "EEUU" | "2016-10-03" | "T_NAME: name_2\nT_D: QQAA\nT_IN: ind_2\nT_C: c1ws12 ...........\nT_R: 46ee" |
| . | . | . |
| . | . | . |
| "EEUU" | "2016-10-03" | "T_NAME: name_300000\nT_ADD: Sec_1_P\nT_IN: ind_65\nT_C: c1ws12\n ...........\nT_R: 47aa" |
+---------+--------------+------------------------------------------------------------------------------------------------------+
1条答案
按热度按时间gev0vcfq1#
在这种情况下使用正则表达式是缓慢的,也是脆弱的。
如果您知道所有记录都具有相同的结构,即所有“text”值都具有相同的“parts”编号和顺序,那么下面的代码可以工作(对于任意数量的列),主要利用
split
中的函数org.apache.spark.sql.functions
:或者,如果上述假设不成立,则可以使用udf将文本列转换为
Map
,然后执行类似的reduceLeft
对所需列的硬编码列表的操作: