我有下面的数据格式,我正试图从中提取id部分, {"memberurn"=urn:li:member:10000012} 这是我的密码,
{"memberurn"=urn:li:member:10000012}
CAST(regexp_extract(key.memberurn, 'urn:li:member:(\\d+)', 1) AS BIGINT) AS member_id
在输出成员中\u id为null我做错了什么?
kxkpmulp1#
试试这个:
from pyspark.sql import SparkSession import pyspark.sql.functions as F from pyspark.sql.types import LongType spark = SparkSession.builder \ .appName('practice')\ .getOrCreate() sc= spark.sparkContext df= sc.parallelize([ [""" {"memberurn"=urn:li:member:10000012}"""]]).toDF(["a"]) df.show(truncate=False) +-------------------------------------+ |a | +-------------------------------------+ | {"memberurn"=urn:li:member:10000012}| +-------------------------------------+ df1= df.withColumn("id", F.regexp_extract(F.col('a'), '(urn:li:member:)(\d+)', 2)) df2= df1.withColumn("id",df1["id"].cast(LongType())) df2.show() +-------------------------------------+--------+ |a |id | +-------------------------------------+--------+ | {"memberurn"=urn:li:member:10000012}|10000012| +-------------------------------------+--------+ print(df2.printSchema()) root |-- a: string (nullable = true) |-- id: long (nullable = true)
velaa5lx2#
在斯卡拉-
val df = spark.range(1).withColumn("memberurn", lit("urn:li:member:10000012")) df.withColumn("member_id", expr("""CAST(regexp_extract(memberurn, 'urn:li:member:(\\d+)', 1) AS BIGINT)""")) .show(false) /** * +---+----------------------+---------+ * |id |memberurn |member_id| * +---+----------------------+---------+ * |0 |urn:li:member:10000012|10000012 | * +---+----------------------+---------+ */
df.withColumn("member_id", substring_index($"memberurn", ":", -1).cast("bigint")) .show(false) /** * +---+----------------------+---------+ * |id |memberurn |member_id| * +---+----------------------+---------+ * |0 |urn:li:member:10000012|10000012 | * +---+----------------------+---------+ */
2条答案
按热度按时间kxkpmulp1#
试试这个:
velaa5lx2#
在斯卡拉-
使用正则表达式
使用子字符串索引