python3.x—我想基于pyspark中的模式匹配,将所有条目提取为列表

gab6jxml  于 2021-05-29  发布在  Spark
关注(0)|答案(2)|浏览(388)

我有一个名为tags的字段。它包含一个或多个以大小开头的值。
图案就是大小_
例如:

+---------------------------------------------+
|                tags                         |
+---------------------------------------------+
|The size available are size_10 and size_100. |
|                                             |
|The size available are size_10               |
|The size available are size_20               |

我想单独提取值作为数组(即)

+----------------------------------------------------------+
|                tags                         |size        |
+----------------------------------------------------------+
|The size available are size_10 and size_100. |[10, 20]    |
|                                             |    []      |
|The size available are size_10               |   [10]     | 
|The size available are size_20               |   [20]     |

你能帮我解决。。。

yzuktlbb

yzuktlbb1#

上述代码的python等价物是:

df.withColumn('d',f.split(f.regexp_replace(f.concat_ws(',',f.array_except(f.split('data',' '),f.split(f.regexp_replace('data','(size_\d+)',''),' ')))
                                  ,"[^0-9$,]",""),',')).show(20,False)

如果你的数据集不是那么大,你也可以用自定义项来做

import re
from pyspark.sql.functions import udf

extract = udf(lambda s: list(map(lambda x: x.split('_')[1] if len(x)>0 else x,re.findall(r'(size_\d+)', s))), ArrayType(StringType()))

df.withColumn('values', extract('data')).show()

两种情况下的输出

+--------------------+---------+
|                data|   values|
+--------------------+---------+
|The size availabl...|[10, 100]|
|The size availabl...|     [10]|
|                    |       []|
|The size availabl...|     [20]|
|             size_10|     [10]|
+--------------------+---------+
xkrw2x1b

xkrw2x1b2#

在scala中,python版本几乎相同:

val df = Seq("The size available are size_10 and size_100."," ","The size available are size_10","The size available are size_20").toDF()
df.show(false)
+--------------------------------------------+
|value                                       |
+--------------------------------------------+
|The size available are size_10 and size_100.|
|                                            |
|The size available are size_10              |
|The size available are size_20              |
+--------------------------------------------+

df.select('value,split(regexp_replace('value, "(?:size_?)[^\\s]+","")," ").as("a"),split('value," ").as("b"))
  .select('value,split(regexp_replace(concat_ws(",",array_except('b,'a)),"[^0-9$,]",""),",").as("size"))
  .show(false)

+--------------------------------------------+---------+
|value                                       |size     |
+--------------------------------------------+---------+
|The size available are size_10 and size_100.|[10, 100]|
|                                            |[]       |
|The size available are size_10              |[10]     |
|The size available are size_20              |[20]     |
+--------------------------------------------+---------+

相关问题