如何找到与特定模式匹配的URL?

d4so4syb  于 2021-09-08  发布在  Java
关注(0)|答案(4)|浏览(331)

我有一个包含不同类型URL的URL列表。下面给出了一个示例:

url_list = [

 'https://tabelog.com/aomori/rstLst/cond05-03-00/',
 'https://tabelog.com/aomori/rstLst/MC11/',
 'https://tabelog.com/aomori/A0203/A020301/',
 'https://tabelog.com/aomori/C2401/rstLst/',
 'https://tabelog.com/aomori/rstLst/MC/',
 'https://tabelog.com/aomori/rstLst/cond04-00-01/',
 'https://tabelog.com/aomori/A0203/A020301/R11609/rstLst/',
 'https://tabelog.com/aomori/rstLst/MC21/',
 'https://tabelog.com/aomori/A0203/A020301/2011528/',
 'https://tabelog.com/aomori/A0202/A020201/2008713/dtlrvwlst/B432614271/',
 'https://tabelog.com/aomori/A0205/A020502/2008632/dtlrvwlst/B106889387/',
 'https://tabelog.com/aomori/C2343/rstLst/',
 'https://tabelog.com/aomori/C2202/rstLst/',
 'https://tabelog.com/aomori/A0205/',
 'https://tabelog.com/aomori/C2208/rstLst/',
 'https://tabelog.com/aomori/rstLst/unagi/',
 'https://tabelog.com/aomori/C2361/rstLst/',
 'https://tabelog.com/aomori/A0201/A020101/2005741/dtlrvwlst/',
 'https://tabelog.com/aomori/A0201/A020101/2010629/dtlrvwlst/',
 'https://tabelog.com/aomori/C2443/rstLst/',
 'https://tabelog.com/aomori/A0202/A020201/2008713/',
 'https://tabelog.com/aomori/rstLst/CC06/',
 'https://tabelog.com/aomori/A0202/A020201/2011530/',
]

从这里,我想分离具有以下模式的URL:

'https://tabelog.com/aomori/A0202/A020201/2008713/'

在这里 https://tabelog.com/aomori/ 部分总是常见的。在这部分之后,总有三个 / 分离值( A0202/A020201/2008713/ ). 在 A0202A020201 零件,总是从 A 但是数字的数量是不一样的。
因此,如果我将所需的URL与 url_list ```
final_url_list = [
'https://tabelog.com/aomori/A0203/A020301/2011528/,
'https://tabelog.com/aomori/A0202/A020201/2008713/',
'https://tabelog.com/aomori/A0202/A020201/2011530/',
]

有人知道如何使用python分离这些URL吗?
ztyzrc3y

ztyzrc3y1#

您可以使用regexp来实现此功能。例如,以下模式:

'https:\/\/tabelog\.com\/aomori\/A\d+\/A\d+\/\d+\/'

它将匹配所有以/'结尾的行,这意味着它将忽略这些继续的URL。
您可以在此处进行调试:https://regex101.com/r/zhk534/1
您还可以在此处生成python示例代码:https://regex101.com/r/zhk534/1/codegen?language=python

uajslkp6

uajslkp62#

也许(我从你的原始列表中筛选出三个结果)。。。

import re

url_list = [

 'https://tabelog.com/aomori/rstLst/cond05-03-00/',
 'https://tabelog.com/aomori/rstLst/MC11/',
 'https://tabelog.com/aomori/A0203/A020301/',
 'https://tabelog.com/aomori/C2401/rstLst/',
 'https://tabelog.com/aomori/rstLst/MC/',
 'https://tabelog.com/aomori/rstLst/cond04-00-01/',
 'https://tabelog.com/aomori/A0203/A020301/R11609/rstLst/',
 'https://tabelog.com/aomori/rstLst/MC21/',
 'https://tabelog.com/aomori/A0203/A020301/2011528/',
 'https://tabelog.com/aomori/A0202/A020201/2008713/dtlrvwlst/B432614271/',
 'https://tabelog.com/aomori/A0205/A020502/2008632/dtlrvwlst/B106889387/',
 'https://tabelog.com/aomori/C2343/rstLst/',
 'https://tabelog.com/aomori/C2202/rstLst/',
 'https://tabelog.com/aomori/A0205/',
 'https://tabelog.com/aomori/C2208/rstLst/',
 'https://tabelog.com/aomori/rstLst/unagi/',
 'https://tabelog.com/aomori/C2361/rstLst/',
 'https://tabelog.com/aomori/A0201/A020101/2005741/dtlrvwlst/',
 'https://tabelog.com/aomori/A0201/A020101/2010629/dtlrvwlst/',
 'https://tabelog.com/aomori/C2443/rstLst/',
 'https://tabelog.com/aomori/A0202/A020201/2008713/',
 'https://tabelog.com/aomori/rstLst/CC06/',
 'https://tabelog.com/aomori/A0202/A020201/2011530/',
]

pattern = re.compile("https://tabelog\.com/aomori/A\d+/A\d+/\d+/$")
filtered_list = list(filter(pattern.match, url_list))

print(filtered_list)

输出:

['https://tabelog.com/aomori/A0203/A020301/2011528/', 'https://tabelog.com/aomori/A0202/A020201/2008713/', 'https://tabelog.com/aomori/A0202/A020201/2011530/']
ccrfmcuu

ccrfmcuu3#

使用nurio的正则表达式:

import re

url_list = [

 'https://tabelog.com/aomori/rstLst/cond05-03-00/',
 'https://tabelog.com/aomori/rstLst/MC11/',
 'https://tabelog.com/aomori/A0203/A020301/',
 'https://tabelog.com/aomori/C2401/rstLst/',
 'https://tabelog.com/aomori/rstLst/MC/',
 'https://tabelog.com/aomori/rstLst/cond04-00-01/',
 'https://tabelog.com/aomori/A0203/A020301/R11609/rstLst/',
 'https://tabelog.com/aomori/rstLst/MC21/',
 'https://tabelog.com/aomori/A0203/A020301/2011528/',
 'https://tabelog.com/aomori/A0202/A020201/2008713/dtlrvwlst/B432614271/',
 'https://tabelog.com/aomori/A0205/A020502/2008632/dtlrvwlst/B106889387/',
 'https://tabelog.com/aomori/C2343/rstLst/',
 'https://tabelog.com/aomori/C2202/rstLst/',
 'https://tabelog.com/aomori/A0205/',
 'https://tabelog.com/aomori/C2208/rstLst/',
 'https://tabelog.com/aomori/rstLst/unagi/',
 'https://tabelog.com/aomori/C2361/rstLst/',
 'https://tabelog.com/aomori/A0201/A020101/2005741/dtlrvwlst/',
 'https://tabelog.com/aomori/A0201/A020101/2010629/dtlrvwlst/',
 'https://tabelog.com/aomori/C2443/rstLst/',
 'https://tabelog.com/aomori/A0202/A020201/2008713/',
 'https://tabelog.com/aomori/rstLst/CC06/',
 'https://tabelog.com/aomori/A0202/A020201/2011530/',
]

r = re.compile('https:\/\/tabelog\.com\/aomori\/A\d+\/A\d+\/\d+\/')
newlist = list(filter(r.match, url_list))
print (newlist)

样本输出:

['https://tabelog.com/aomori/A0203/A020301/2011528/', 'https://tabelog.com/aomori/A0202/A020201/2008713/dtlrvwlst/B432614271/', 'https://tabelog.com/aomori/A0205/A020502/2008632/dtlrvwlst/B106889387/', 'https://tabelog.com/aomori/A0201/A020101/2005741/dtlrvwlst/', 'https://tabelog.com/aomori/A0201/A020101/2010629/dtlrvwlst/', 'https://tabelog.com/aomori/A0202/A020201/2008713/', 'https://tabelog.com/aomori/A0202/A020201/2011530/']
1szpjjfi

1szpjjfi4#

使用 re :

import re

final_url_list = []

for url in url_list:
    m = re.findall('(.*A\d+/A\d+/\d+/$)', url)
    if len(m) > 0:
        final_url_list.append(m[0])

产生 final_url_list 属于

['https://tabelog.com/aomori/A0203/A020301/2011528/',
 'https://tabelog.com/aomori/A0202/A020201/2008713/',
 'https://tabelog.com/aomori/A0202/A020201/2011530/']

我发现https://pythex.org/ 在构造正则表达式时非常有用。

相关问题