我正在使用一些复杂的正则表达式在databricks notebook上运行一个查询。
%sql
SELECT * from (
SELECT
DISTINCT
md.asset_identity_serial AS serial_number,
md.pigeon_processed_timestamp AS ir_date,
ir.pagecount AS lifetime_page_count,
ir.recordid,
TRIM(ir.serialnum) AS supply_serial,
CASE
WHEN panelmessage RLIKE '\\[.+\\]' AND REGEXP_EXTRACT(panelmessage, '\\[([0-9.]+)([A-Za-z]+)?\\]', 1) IS NOT NULL
THEN CAST(REGEXP_EXTRACT(panelmessage, '\\[([0-9.]+)([A-Za-z]+)?\\]', 1) AS DECIMAL(10,2))
WHEN panelmessage RLIKE '^\\d+\\.\\d+[[:alpha:]]*' THEN CAST(REGEXP_EXTRACT(panelmessage, '^\\d+\\.\\d+[[:alpha:]]*') AS DECIMAL(10,2))
WHEN panelmessage RLIKE '\\d+\\.\\d+[[:alpha:]]*$' THEN CAST(REGEXP_EXTRACT(panelmessage, '\\d+\\.\\d+[[:alpha:]]*$') AS DECIMAL(10,2))
WHEN (NOT panelmessage RLIKE '\\[.+\\]' OR panelmessage IS NULL) AND eventname RLIKE '\\[.+\\]' AND REGEXP_EXTRACT(eventname, '\\[([0-9.]+)([A-Za-z]+)?\\]', 1) IS NOT NULL
THEN CAST(REGEXP_EXTRACT(eventname, '\\[([0-9.]+)([A-Za-z]+)?\\]', 1) AS DECIMAL(10,2))
WHEN (NOT panelmessage RLIKE '^\\d+\\.\\d+[[:alpha:]]*' OR panelmessage IS NULL) AND eventname RLIKE '^\\d+\\.\\d+[[:alpha:]]*' THEN CAST(REGEXP_EXTRACT(eventname, '^\\d+\\.\\d+[[:alpha:]]*') AS DECIMAL(10,2))
WHEN (NOT panelmessage RLIKE '\\d+\\.\\d+[[:alpha:]]*$' OR panelmessage IS NULL) AND eventname RLIKE '\\d+\\.\\d+[[:alpha:]]*$' THEN CAST(REGEXP_EXTRACT(eventname, '\\d+\\.\\d+[[:alpha:]]*$') AS DECIMAL(10,2))
ELSE NULL
END AS error_code,
CASE
WHEN error_code IS NULL THEN NULL
WHEN error_code = '000.00' THEN error_code
ELSE REPLACE(LTRIM(REPLACE(error_code, '0', ' ')), ' ', '0')
END AS error_code_trimmed,
CASE
WHEN POSITION('.' IN error_code_trimmed) > 0 THEN REGEXP_EXTRACT(error_code_trimmed, '\\d+\\.\\d+[^[:alpha:]]*', 0)
ELSE error_code_trimmed
END AS error_code_simplified,
CASE
WHEN POSITION('.' IN error_code_simplified) > 0 THEN SUBSTRING(error_code_simplified, 1, POSITION('.' IN error_code_simplified) - 1)
ELSE error_code_simplified
END AS yyy,
CASE
WHEN LENGTH(yyy) = 3 THEN SUBSTRING(yyy, 1, 2)
ELSE yyy
END AS yy,
CASE
WHEN LENGTH(yy) = 2 THEN SUBSTRING(yy, 1, 1)
ELSE yy
END AS y
FROM
qa.pigeon_silver.ael_events_supply_ir_warning AS ir
INNER JOIN
qa.pigeon_silver.ael_events_device_collector_metadata AS md
ON
md.s3_file_location = ir.s3_file_location
WHERE
md.pigeon_processed_timestamp >= DATEADD(DAY, -365, CURRENT_DATE)
AND TRIM(ir.serialnum) IS NOT NULL AND TRIM(ir.serialnum) <> ''
AND SUBSTRING(TRIM(ir.serialnum), 1, 2) IN ('CA')
AND LENGTH(ir.serialnum) = 12
AND (
panelmessage RLIKE '\\[\\d+.*]|^\\d+\\.\\d+[[:alpha:]]*|\\d+\\.\\d+[[:alpha:]]*$'
OR eventname RLIKE '\\[\\d+.*]|^\\d+\\.\\d+[[:alpha:]]*|\\d+\\.\\d+[[:alpha:]]*$'
)
AND (
(LENGTH(md.asset_identity_serial) = 7
AND SUBSTRING(md.asset_identity_serial, 1, 2) IN ('01','02','04','05','06','07','08','09','11','23','24','25','26','27','32','35','41','58','59','62','67','68','69','72','79','89','94','98','99'))
OR (LENGTH(md.asset_identity_serial) = 13 AND SUBSTRING(md.asset_identity_serial, 1, 2) IN ('14','15','34','35','40','44','46','50','70','71','74','75'))
)) WHERE yyy in ('32')
字符串
一旦我运行单元格,它抛出错误“SparkIllegalArgumentException:Regex组计数为0,但指定的组索引为1。”有人可以帮助我纠正这个错误吗?
1条答案
按热度按时间chhkpiq41#
看看documentation:
regexp_extract function
论点
str
:要匹配的STRING表达式。regexp
:一个匹配模式的STRING表达式。idx
:大于或等于0的可选整数表达式,默认值为1。这意味着,如果要获得整个匹配,而不仅仅是部分匹配,则必须指定
0
idx
值。范例:
REGEXP_EXTRACT(panelmessage, '^\\d+\\.\\d+[[:alpha:]]*')
REGEXP_EXTRACT(panelmessage, '^\\d+\\.\\d+[[:alpha:]]*', 0)
对所有其他
REGEXP_EXTRACT
案例执行此操作。