我尝试从以下列表中只选择特定的文本块,并将结果输出到数据框中:
test = [
'bbb', 'soup test0:88', 'axx', 'xzz', 'one test4: 34',
'COPYSUCCESSFUL: https://test.test2.nugget.com/f02/01/test1.csv',
'COPYSUCCESSFUL: https://test.test3.nugget.com/f02/01/test3.csv',
'one test4: 66'
]
我正在使用的代码:
import re
import pandas as pd
import numpy as np
test = ['bbb', 'soup test0:88', 'axx', 'xzz', 'one test4: 34','COPYSUCCESSFUL: https://test.test2.nugget.com/f02/01/test1.csv',
'COPYSUCCESSFUL: https://test.test3.nugget.com/f02/01/test3.csv', 'one test4: 66']
# regex pattern to extract the text after "COPYSUCCESSFUL:" and before "'"
pattern1 = re.compile(r"COPYSUCCESSFUL:\s*(.*?)(?=')")
# regex pattern to extract the value after "one test4:"
pattern2 = re.compile(r"one test4:\s*(\d+)")
# regex pattern to extract the value after "soup test0:"
pattern3 = re.compile(r"soup test0:\s*(\d+)")
# create empty lists to store the extracted data
copysuccessful = []
one_test4 = []
soup_test0 = []
# iterate through the list and extract the required data using regular expressions
for item in test:
match1 = pattern1.search(item)
match2 = pattern2.search(item)
match3 = pattern3.search(item)
if match1:
copysuccessful.append(match1.group(1))
else:
copysuccessful.append(np.nan)
if match2:
one_test4.append(match2.group(1))
else:
one_test4.append(np.nan)
if match3:
soup_test0.append(match3.group(1))
else:
soup_test0.append(np.nan)
# create a dictionary to store the extracted data
data = {'COPYSUCCESSFUL': copysuccessful, 'one test4': one_test4, 'soup test0': soup_test0}
# create a pandas dataframe from the dictionary
df = pd.DataFrame(data)
# print the dataframe
print(df)
但是我得到的输出是:
COPYSUCCESSFUL one test4 soup test0
0 NaN NaN NaN
1 NaN NaN 88
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN 34 NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN 66 NaN
所以列COPYSUCCESSFUL没有输出。我用过一些regex测试器,看起来一切正常,所以我不明白为什么列的输出中没有任何内容。我希望“https://test.test2.nugget.com/f02/01/test1.csv“和“https://test.test3.nugget.com/f02/01/test3.csv“都出现在列中。
竭诚欢迎任何人的帮助!
1条答案
按热度按时间e0bqpujr1#
列COPYSUCCESSFUL没有输出。我用过一些regex测试器,看起来一切正常,所以我不明白为什么列的输出中没有任何内容
从你的正则表达式
COPYSUCCESSFUL:\s*(.*?)(?=')
看,你似乎认为你的字符串会以'
符号结尾,但事实并非如此,当你用python写'abc'
时,你定义了一个内容为abc
的字符串,引号只是语法,而不是实际的数据。因为正则表达式需要一个
'
,而字符串中没有,所以最终没有任何匹配。查看示例数据,我认为您可以使用regex
^COPYSUCCESSFUL:\s*(.*)
。顺便说一句
由于您要查找多个关键字,如
COPYSUCCESSFUL
、one test4
等,因此使用更灵活的方法可能比使用一大行if else
语句更容易、更快: