hive-使用regxp创建组并将它们分配到数组中

pdsfdshx  于 2021-06-02  发布在  Hadoop
关注(0)|答案(1)|浏览(381)

我有一个正则表达式来解析表中的一些原始数据,比如: ',?([\w]*|\d*)' .

select regexp_extract(raw_line,',?([\w]*|\d*){1}',1) as field1
      , regexp_extract(raw_line,',?([\w]*|\d*){2}',1) as field2
      , ...
      , regexp_extract(raw_line,',?([\w]*|\d*){n}',1) as fieldn
 from table

这将创建组来解析csv字段。问题是要解析的表和每一行都很长,因此这是一个昂贵的操作。
我想知道我是否可以使用正则表达式(没有组) {n} 把它分成一个数组。然后我可以使用数组索引来获取每个字段。比如:

select arr_raw[0] as field1,
          ...,
        arr_raw[n] as fieldn
 from (
     split(
         select regexp_extract(raw_line, ',?([\w]*|\d*)'
     ,  ) as arr_raw -- ??
 )t

我知道表中有多少个组,但不确定这是否可行,或者正确的语法/方法应该是什么。
谢谢。

ne5o7dgx

ne5o7dgx1#

with raw_sample as (
select 'field1,field2,fiend3,123,456,"http://some.domain/abc/Player.aspx?playerID=111&BrowseIds=2221,423062611,423870887,424044345,...,",THIS_IS_MY,en,20 294 998 1001,end' as raw_line
)

select  fields[0] as field0
       ,fields[1] as field1
       ,fields[2] as field2
       ,fields[3] as field3
       ,fields[4] as field4
       ,fields[5] as field5
       ,fields[6] as field6
       ,fields[7] as field7
       ,fields[8] as field8
       ,fields[9] as field9

from   (select  split(regexp_replace(raw_line,'(".*?"|[^,]*),',concat('$1',unhex(1))),'\\x01')   as fields

        from    raw_sample
        ) t
;

相关问题