pyspark SparkIllegalArgumentException:Regex组计数为0,但指定的组索引为1错误

f87krz0w  于 12个月前  发布在  Spark
关注(0)|答案(1)|浏览(219)

我正在使用一些复杂的正则表达式在databricks notebook上运行一个查询。

%sql
    SELECT * from (
    SELECT
        DISTINCT
        md.asset_identity_serial AS serial_number,
        md.pigeon_processed_timestamp AS ir_date,
        ir.pagecount AS lifetime_page_count,
        ir.recordid,
        TRIM(ir.serialnum) AS supply_serial,
        CASE
            WHEN panelmessage RLIKE '\\[.+\\]' AND REGEXP_EXTRACT(panelmessage, '\\[([0-9.]+)([A-Za-z]+)?\\]', 1) IS NOT NULL
                THEN CAST(REGEXP_EXTRACT(panelmessage, '\\[([0-9.]+)([A-Za-z]+)?\\]', 1) AS DECIMAL(10,2))
            WHEN panelmessage RLIKE '^\\d+\\.\\d+[[:alpha:]]*' THEN CAST(REGEXP_EXTRACT(panelmessage, '^\\d+\\.\\d+[[:alpha:]]*') AS DECIMAL(10,2))
            WHEN panelmessage RLIKE '\\d+\\.\\d+[[:alpha:]]*$' THEN CAST(REGEXP_EXTRACT(panelmessage, '\\d+\\.\\d+[[:alpha:]]*$') AS DECIMAL(10,2))
            WHEN (NOT panelmessage RLIKE '\\[.+\\]' OR panelmessage IS NULL) AND eventname RLIKE '\\[.+\\]' AND REGEXP_EXTRACT(eventname, '\\[([0-9.]+)([A-Za-z]+)?\\]', 1) IS NOT NULL
                THEN CAST(REGEXP_EXTRACT(eventname, '\\[([0-9.]+)([A-Za-z]+)?\\]', 1) AS DECIMAL(10,2))
            WHEN (NOT panelmessage RLIKE '^\\d+\\.\\d+[[:alpha:]]*' OR panelmessage IS NULL) AND eventname RLIKE '^\\d+\\.\\d+[[:alpha:]]*' THEN CAST(REGEXP_EXTRACT(eventname, '^\\d+\\.\\d+[[:alpha:]]*') AS DECIMAL(10,2))
            WHEN (NOT panelmessage RLIKE '\\d+\\.\\d+[[:alpha:]]*$' OR panelmessage IS NULL) AND eventname RLIKE '\\d+\\.\\d+[[:alpha:]]*$' THEN CAST(REGEXP_EXTRACT(eventname, '\\d+\\.\\d+[[:alpha:]]*$') AS DECIMAL(10,2))
            ELSE NULL
        END AS error_code,
        CASE
            WHEN error_code IS NULL THEN NULL
            WHEN error_code = '000.00' THEN error_code
            ELSE REPLACE(LTRIM(REPLACE(error_code, '0', ' ')), ' ', '0')
        END AS error_code_trimmed,
        CASE
            WHEN POSITION('.' IN error_code_trimmed) > 0 THEN REGEXP_EXTRACT(error_code_trimmed, '\\d+\\.\\d+[^[:alpha:]]*', 0)
            ELSE error_code_trimmed
        END AS error_code_simplified,
        CASE
            WHEN POSITION('.' IN error_code_simplified) > 0 THEN SUBSTRING(error_code_simplified, 1, POSITION('.' IN error_code_simplified) - 1)
            ELSE error_code_simplified
        END AS yyy,
        CASE
            WHEN LENGTH(yyy) = 3 THEN SUBSTRING(yyy, 1, 2)
            ELSE yyy
        END AS yy,
        CASE
            WHEN LENGTH(yy) = 2 THEN SUBSTRING(yy, 1, 1)
            ELSE yy
        END AS y
    FROM
        qa.pigeon_silver.ael_events_supply_ir_warning AS ir
    INNER JOIN
        qa.pigeon_silver.ael_events_device_collector_metadata AS md
    ON
        md.s3_file_location = ir.s3_file_location
    WHERE
        md.pigeon_processed_timestamp >= DATEADD(DAY, -365, CURRENT_DATE)
        AND TRIM(ir.serialnum) IS NOT NULL AND TRIM(ir.serialnum) <> ''
        AND SUBSTRING(TRIM(ir.serialnum), 1, 2) IN ('CA')
        AND LENGTH(ir.serialnum) = 12
        AND (
            panelmessage RLIKE '\\[\\d+.*]|^\\d+\\.\\d+[[:alpha:]]*|\\d+\\.\\d+[[:alpha:]]*$'
            OR eventname RLIKE '\\[\\d+.*]|^\\d+\\.\\d+[[:alpha:]]*|\\d+\\.\\d+[[:alpha:]]*$'
        )
        AND (
            (LENGTH(md.asset_identity_serial) = 7
            AND SUBSTRING(md.asset_identity_serial, 1, 2) IN ('01','02','04','05','06','07','08','09','11','23','24','25','26','27','32','35','41','58','59','62','67','68','69','72','79','89','94','98','99'))
            OR (LENGTH(md.asset_identity_serial) = 13 AND SUBSTRING(md.asset_identity_serial, 1, 2) IN ('14','15','34','35','40','44','46','50','70','71','74','75'))
        )) WHERE yyy in ('32')

字符串
一旦我运行单元格,它抛出错误“SparkIllegalArgumentException:Regex组计数为0,但指定的组索引为1。”有人可以帮助我纠正这个错误吗?

chhkpiq4

chhkpiq41#

看看documentation

regexp_extract function

论点

  • str:要匹配的STRING表达式。
  • regexp:一个匹配模式的STRING表达式。
  • idx:大于或等于0的可选整数表达式,默认值为1

这意味着,如果要获得整个匹配,而不仅仅是部分匹配,则必须指定0idx值。
范例:

  • 你有REGEXP_EXTRACT(panelmessage, '^\\d+\\.\\d+[[:alpha:]]*')
  • 更改为REGEXP_EXTRACT(panelmessage, '^\\d+\\.\\d+[[:alpha:]]*', 0)

对所有其他REGEXP_EXTRACT案例执行此操作。

相关问题