正则表达式在Python中对一个元素无效

6yjfywim  于 2023-02-26  发布在  Python
关注(0)|答案(1)|浏览(141)

我尝试从以下列表中只选择特定的文本块,并将结果输出到数据框中:

test = [
  'bbb', 'soup test0:88', 'axx', 'xzz', 'one test4: 34',
  'COPYSUCCESSFUL: https://test.test2.nugget.com/f02/01/test1.csv',
  'COPYSUCCESSFUL: https://test.test3.nugget.com/f02/01/test3.csv',
  'one test4: 66'
]

我正在使用的代码:

import re
import pandas as pd
import numpy as np

test = ['bbb', 'soup test0:88', 'axx', 'xzz', 'one test4: 34','COPYSUCCESSFUL: https://test.test2.nugget.com/f02/01/test1.csv',
        'COPYSUCCESSFUL: https://test.test3.nugget.com/f02/01/test3.csv', 'one test4: 66']

# regex pattern to extract the text after "COPYSUCCESSFUL:" and before "'"
pattern1 = re.compile(r"COPYSUCCESSFUL:\s*(.*?)(?=')")

# regex pattern to extract the value after "one test4:"
pattern2 = re.compile(r"one test4:\s*(\d+)")

# regex pattern to extract the value after "soup test0:"
pattern3 = re.compile(r"soup test0:\s*(\d+)")

# create empty lists to store the extracted data
copysuccessful = []
one_test4 = []
soup_test0 = []

# iterate through the list and extract the required data using regular expressions
for item in test:
    match1 = pattern1.search(item)
    match2 = pattern2.search(item)
    match3 = pattern3.search(item)
    
    if match1:
        copysuccessful.append(match1.group(1))
    else:
        copysuccessful.append(np.nan)
    if match2:
        one_test4.append(match2.group(1))
    else:
        one_test4.append(np.nan)
    if match3:
        soup_test0.append(match3.group(1))
    else:
        soup_test0.append(np.nan)

# create a dictionary to store the extracted data
data = {'COPYSUCCESSFUL': copysuccessful, 'one test4': one_test4, 'soup test0': soup_test0}

# create a pandas dataframe from the dictionary
df = pd.DataFrame(data)

# print the dataframe
print(df)

但是我得到的输出是:

COPYSUCCESSFUL one test4 soup test0
0             NaN       NaN        NaN
1             NaN       NaN         88
2             NaN       NaN        NaN
3             NaN       NaN        NaN
4             NaN        34        NaN
5             NaN       NaN        NaN
6             NaN       NaN        NaN
7             NaN        66        NaN

所以列COPYSUCCESSFUL没有输出。我用过一些regex测试器,看起来一切正常,所以我不明白为什么列的输出中没有任何内容。我希望“https://test.test2.nugget.com/f02/01/test1.csv“和“https://test.test3.nugget.com/f02/01/test3.csv“都出现在列中。
竭诚欢迎任何人的帮助!

e0bqpujr

e0bqpujr1#

列COPYSUCCESSFUL没有输出。我用过一些regex测试器,看起来一切正常,所以我不明白为什么列的输出中没有任何内容
从你的正则表达式COPYSUCCESSFUL:\s*(.*?)(?=')看,你似乎认为你的字符串会以'符号结尾,但事实并非如此,当你用python写'abc'时,你定义了一个内容为abc的字符串,引号只是语法,而不是实际的数据。
因为正则表达式需要一个',而字符串中没有,所以最终没有任何匹配。
查看示例数据,我认为您可以使用regex ^COPYSUCCESSFUL:\s*(.*)

顺便说一句

由于您要查找多个关键字,如COPYSUCCESSFULone test4等,因此使用更灵活的方法可能比使用一大行if else语句更容易、更快:

def  all_nan():
  return [np.nan] * len(items)

keywords = {
  'COPYSUCCESSFUL': all_nan(),
  'one test4':      all_nan(),
  'soup test0':     all_nan()
}
for item in test:
  key, _, value = re.partition(':')
  if key in keywords:
    keywords[key].append(value.lstrip())

相关问题