我尝试从以下列表中只选择特定的文本块，并将结果输出到数据框中：

test = [
  'bbb', 'soup test0:88', 'axx', 'xzz', 'one test4: 34',
  'COPYSUCCESSFUL: https://test.test2.nugget.com/f02/01/test1.csv',
  'COPYSUCCESSFUL: https://test.test3.nugget.com/f02/01/test3.csv',
  'one test4: 66'
]

我正在使用的代码：

import re
import pandas as pd
import numpy as np

test = ['bbb', 'soup test0:88', 'axx', 'xzz', 'one test4: 34','COPYSUCCESSFUL: https://test.test2.nugget.com/f02/01/test1.csv',
        'COPYSUCCESSFUL: https://test.test3.nugget.com/f02/01/test3.csv', 'one test4: 66']

# regex pattern to extract the text after "COPYSUCCESSFUL:" and before "'"
pattern1 = re.compile(r"COPYSUCCESSFUL:\s*(.*?)(?=')")

# regex pattern to extract the value after "one test4:"
pattern2 = re.compile(r"one test4:\s*(\d+)")

# regex pattern to extract the value after "soup test0:"
pattern3 = re.compile(r"soup test0:\s*(\d+)")

# create empty lists to store the extracted data
copysuccessful = []
one_test4 = []
soup_test0 = []

# iterate through the list and extract the required data using regular expressions
for item in test:
    match1 = pattern1.search(item)
    match2 = pattern2.search(item)
    match3 = pattern3.search(item)
    
    if match1:
        copysuccessful.append(match1.group(1))
    else:
        copysuccessful.append(np.nan)
    if match2:
        one_test4.append(match2.group(1))
    else:
        one_test4.append(np.nan)
    if match3:
        soup_test0.append(match3.group(1))
    else:
        soup_test0.append(np.nan)

# create a dictionary to store the extracted data
data = {'COPYSUCCESSFUL': copysuccessful, 'one test4': one_test4, 'soup test0': soup_test0}

# create a pandas dataframe from the dictionary
df = pd.DataFrame(data)

# print the dataframe
print(df)

但是我得到的输出是：

COPYSUCCESSFUL one test4 soup test0
0             NaN       NaN        NaN
1             NaN       NaN         88
2             NaN       NaN        NaN
3             NaN       NaN        NaN
4             NaN        34        NaN
5             NaN       NaN        NaN
6             NaN       NaN        NaN
7             NaN        66        NaN

所以列COPYSUCCESSFUL没有输出。我用过一些regex测试器，看起来一切正常，所以我不明白为什么列的输出中没有任何内容。我希望“https://test.test2.nugget.com/f02/01/test1.csv“和“https://test.test3.nugget.com/f02/01/test3.csv“都出现在列中。
竭诚欢迎任何人的帮助！

def all_nan(): return [np.nan] * len(items) keywords = { 'COPYSUCCESSFUL': all_nan(), 'one test4': all_nan(), 'soup test0': all_nan() } for item in test: key, _, value = re.partition(':') if key in keywords: keywords[key].append(value.lstrip())

1条答案

按热度按时间

e0bqpujr1#

列COPYSUCCESSFUL没有输出。我用过一些regex测试器，看起来一切正常，所以我不明白为什么列的输出中没有任何内容
从你的正则表达式COPYSUCCESSFUL:\s*(.*?)(?=')看，你似乎认为你的字符串会以'符号结尾，但事实并非如此，当你用python写'abc'时，你定义了一个内容为abc的字符串，引号只是语法，而不是实际的数据。
因为正则表达式需要一个'，而字符串中没有，所以最终没有任何匹配。
查看示例数据，我认为您可以使用regex ^COPYSUCCESSFUL:\s*(.*)。

顺便说一句

由于您要查找多个关键字，如COPYSUCCESSFUL、one test4等，因此使用更灵活的方法可能比使用一大行if else语句更容易、更快：

赞(0）回复(0）举报 2023-02-26

正则表达式在Python中对一个元素无效

1条答案

顺便说一句

相关问题

热门标签

最新问答