REGEX:从连接字符串提取表信息

dddzy1tm  于 2022-12-01  发布在  其他
关注(0)|答案(2)|浏览(131)

我正在尝试从连接字符串数据中提取架构和表信息。架构和表信息的格式为“Schema.Table”(例如下面字符串中的FROM EDWP_D2PM.SN_INC_RPTG_SCRUBBED)。连接字符串中可以存在多个方案和表,并且它们始终跟在FROM或JOIN后面(包括INNER JOIN、LEFT JOIN等)。我只想从特定的数据库中提取连接的模式和表,在所附的示例中,这是DB2。

"let
    Source = DB2.Database(""69.69.69.69"", ""bcudb"", [HierarchicalNavigation=true, Implementation=""Microsoft"", Query=""SELECT i.ASSIGNMENT_GROUP, i.BUSINESS_SERVICE, i.CATEGORY, i.CAUSED_BY, i.CLOSE_CODE, i.CLOSED_ON, i.COMPANY, i.CONTACT_TYPE, i.DESCRIPTION, i.NUMBER, i.PARENT_INCIDENT, i.PRIORITY, i.RESOLVED_ON, i.SHORT_DESCRIPTION, i.CREATED_ON, i.CAUSE_CODE, i.CLOSED, i.CREATED, i.RESOLVED, i.RESOLUTION_MET, p.problem_id FROM EDWP_D2PM.SN_INC_RPTG_SCRUBBED i LEFT OUTER JOIN EDWP_D2PM.SN_INCIDENTS_CUST_RPTG p on p.number = i.number  WHERE i.PRIORITY INLEFT OUTER JOIN EDWP_D2PM.SN_INCIDENTS_CUST_RPTG ('1 - Critical', '2 - High') AND TO_CHAR(i.created_on,'YYYY-MM') > (select to_char((CURRENT DATE - 12 MONTHS),'YYYY-MM') from SYSIBM.SYSDUMMY1) AND (i.CATEGORY <> 'Alert' OR (i.CATEGORY IS NULL)) AND i.PARENT_INCIDENT IS NULL AND i.EXCLUSIONS <> 'R' AND i.CLOSE_CODE <> 'Duplicate - No Action Taken'  with ur""]), RIGHT JOIN EDWP.TEMP  blaha blah
    #""Changed Type"" = Table.TransformColumnTypes(Source,{{""RESOLVED_ON"", type datetime}})
in
    #""Changed Type"""

我有一个REGEX表达式,它正确地返回了示例文本中的4个模式.表(link to regex 101 with working example

(?:\s+JOIN\s+)(\w+\.\w+)|(?:\s+FROM\s+)(\w+\.\w+)

我想改进REGEX,这样我就可以只获取以文本“DB2.Database”开头的字符串的模式和表。我该怎么做呢?
我尝试添加前缀:

(?:DB2.Database)(?:\s|\S)*

但这会阻止返回4Schema.Table。
有没有人能建议一个修复方法?如果你确实提供了答案,对REGEX逻辑的解释将会很感激。
一旦我有了REGEX的工作,我将运行它在Python中使用re模块。

afdcj2ne

afdcj2ne1#

您可以尝试在前缀后放置?,这将导致表达式与尽可能多的文本不匹配

(?:DB2.Database)(?:\s|\S)*?

然后下面

(?:DB2.Database)(((?:\s|\S)*?(?:\s+(JOIN|FROM)\s+)(\w+\.\w+))+)

几乎可以工作,但re module doesn't support repeated captures
作为一种解决办法,运行带有第一个匹配组的原始regex (?:\s+(FROM|JOIN)\s+)(\w+\.\w+)应该会得到所需的结果。

iyr7buue

iyr7buue2#

由@radof提出的解决方案的简单实现

import re
import pandas

text ="""let
    Source = DB2.Database(""69.69.69.69"", ""bcudb"", [HierarchicalNavigation=true, Implementation=""Microsoft"", Query=""SELECT i.ASSIGNMENT_GROUP, i.BUSINESS_SERVICE, i.CATEGORY, i.CAUSED_BY, i.CLOSE_CODE, i.CLOSED_ON, i.COMPANY, i.CONTACT_TYPE, i.DESCRIPTION, i.NUMBER, i.PARENT_INCIDENT, i.PRIORITY, i.RESOLVED_ON, i.SHORT_DESCRIPTION, i.CREATED_ON, i.CAUSE_CODE, i.CLOSED, i.CREATED, i.RESOLVED, i.RESOLUTION_MET, p.problem_id FROM EDWP_D2PM.SN_INC_RPTG_SCRUBBED i LEFT OUTER JOIN EDWP_D2PM.SN_INCIDENTS_CUST_RPTG p on p.number = i.number  WHERE i.PRIORITY INLEFT OUTER JOIN EDWP_D2PM.SN_INCIDENTS_CUST_RPTG ('1 - Critical', '2 - High') AND TO_CHAR(i.created_on,'YYYY-MM') > (select to_char((CURRENT DATE - 12 MONTHS),'YYYY-MM') from SYSIBM.SYSDUMMY1) AND (i.CATEGORY <> 'Alert' OR (i.CATEGORY IS NULL)) AND i.PARENT_INCIDENT IS NULL AND i.EXCLUSIONS <> 'R' AND i.CLOSE_CODE <> 'Duplicate - No Action Taken'  with ur""]), RIGHT JOIN EDWP.TEMP  blaha blah
    #""Changed Type"" = Table.TransformColumnTypes(Source,{{""RESOLVED_ON"", type datetime
    #filler 
    filler
    filler
    filler
    }})
in
    #""Changed Type"""

DB2_pattern = re.compile(r'(DB2.Database)(?:\s|\S)*', re.IGNORECASE)
Schema_Table = re.compile(r'(?:\s+JOIN\s+)(\w+\.\w+)|(?:\s+FROM\s+)(\w+\.\w+)', re.IGNORECASE)

DB2_matches = DB2_pattern.finditer(text)
for constring in DB2_matches:
    db2=constring.group()
    #print(db2)
    matches = Schema_Table.finditer(db2)
    for i in matches:
      tables=i.group()
      print(tables)

相关问题