如何找到与特定模式匹配的URL？

d4so4syb 于 2021-09-08 发布在 Java

关注(0)|答案(4)|浏览(336)

我有一个包含不同类型URL的URL列表。下面给出了一个示例：

url_list = [

 'https://tabelog.com/aomori/rstLst/cond05-03-00/',
 'https://tabelog.com/aomori/rstLst/MC11/',
 'https://tabelog.com/aomori/A0203/A020301/',
 'https://tabelog.com/aomori/C2401/rstLst/',
 'https://tabelog.com/aomori/rstLst/MC/',
 'https://tabelog.com/aomori/rstLst/cond04-00-01/',
 'https://tabelog.com/aomori/A0203/A020301/R11609/rstLst/',
 'https://tabelog.com/aomori/rstLst/MC21/',
 'https://tabelog.com/aomori/A0203/A020301/2011528/',
 'https://tabelog.com/aomori/A0202/A020201/2008713/dtlrvwlst/B432614271/',
 'https://tabelog.com/aomori/A0205/A020502/2008632/dtlrvwlst/B106889387/',
 'https://tabelog.com/aomori/C2343/rstLst/',
 'https://tabelog.com/aomori/C2202/rstLst/',
 'https://tabelog.com/aomori/A0205/',
 'https://tabelog.com/aomori/C2208/rstLst/',
 'https://tabelog.com/aomori/rstLst/unagi/',
 'https://tabelog.com/aomori/C2361/rstLst/',
 'https://tabelog.com/aomori/A0201/A020101/2005741/dtlrvwlst/',
 'https://tabelog.com/aomori/A0201/A020101/2010629/dtlrvwlst/',
 'https://tabelog.com/aomori/C2443/rstLst/',
 'https://tabelog.com/aomori/A0202/A020201/2008713/',
 'https://tabelog.com/aomori/rstLst/CC06/',
 'https://tabelog.com/aomori/A0202/A020201/2011530/',
]

从这里，我想分离具有以下模式的URL：

'https://tabelog.com/aomori/A0202/A020201/2008713/'

在这里 https://tabelog.com/aomori/ 部分总是常见的。在这部分之后，总有三个 / 分离值( A0202/A020201/2008713/ ). 在 A0202 及 A020201 零件，总是从 A 但是数字的数量是不一样的。
因此，如果我将所需的URL与 url_list ```
final_url_list = [
'https://tabelog.com/aomori/A0203/A020301/2011528/,
'https://tabelog.com/aomori/A0202/A020201/2008713/',
'https://tabelog.com/aomori/A0202/A020201/2011530/',
]

有人知道如何使用python分离这些URL吗？

python regex URL

来源：https://stackoverflow.com/questions/68323502/how-to-find-the-urls-that-match-with-a-certain-pattern

4条答案

按热度按时间

ztyzrc3y1#

您可以使用regexp来实现此功能。例如，以下模式：

'https:\/\/tabelog\.com\/aomori\/A\d+\/A\d+\/\d+\/'

它将匹配所有以/'结尾的行，这意味着它将忽略这些继续的URL。
您可以在此处进行调试：https://regex101.com/r/zhk534/1
您还可以在此处生成python示例代码：https://regex101.com/r/zhk534/1/codegen?language=python

赞(0）回复(0）举报 2021-09-08

uajslkp62#

也许（我从你的原始列表中筛选出三个结果）。。。

import re

url_list = [

 'https://tabelog.com/aomori/rstLst/cond05-03-00/',
 'https://tabelog.com/aomori/rstLst/MC11/',
 'https://tabelog.com/aomori/A0203/A020301/',
 'https://tabelog.com/aomori/C2401/rstLst/',
 'https://tabelog.com/aomori/rstLst/MC/',
 'https://tabelog.com/aomori/rstLst/cond04-00-01/',
 'https://tabelog.com/aomori/A0203/A020301/R11609/rstLst/',
 'https://tabelog.com/aomori/rstLst/MC21/',
 'https://tabelog.com/aomori/A0203/A020301/2011528/',
 'https://tabelog.com/aomori/A0202/A020201/2008713/dtlrvwlst/B432614271/',
 'https://tabelog.com/aomori/A0205/A020502/2008632/dtlrvwlst/B106889387/',
 'https://tabelog.com/aomori/C2343/rstLst/',
 'https://tabelog.com/aomori/C2202/rstLst/',
 'https://tabelog.com/aomori/A0205/',
 'https://tabelog.com/aomori/C2208/rstLst/',
 'https://tabelog.com/aomori/rstLst/unagi/',
 'https://tabelog.com/aomori/C2361/rstLst/',
 'https://tabelog.com/aomori/A0201/A020101/2005741/dtlrvwlst/',
 'https://tabelog.com/aomori/A0201/A020101/2010629/dtlrvwlst/',
 'https://tabelog.com/aomori/C2443/rstLst/',
 'https://tabelog.com/aomori/A0202/A020201/2008713/',
 'https://tabelog.com/aomori/rstLst/CC06/',
 'https://tabelog.com/aomori/A0202/A020201/2011530/',
]

pattern = re.compile("https://tabelog\.com/aomori/A\d+/A\d+/\d+/$")
filtered_list = list(filter(pattern.match, url_list))

print(filtered_list)

输出：

['https://tabelog.com/aomori/A0203/A020301/2011528/', 'https://tabelog.com/aomori/A0202/A020201/2008713/', 'https://tabelog.com/aomori/A0202/A020201/2011530/']

赞(0）回复(0）举报 2021-09-08

ccrfmcuu3#

使用nurio的正则表达式：

import re

url_list = [

 'https://tabelog.com/aomori/rstLst/cond05-03-00/',
 'https://tabelog.com/aomori/rstLst/MC11/',
 'https://tabelog.com/aomori/A0203/A020301/',
 'https://tabelog.com/aomori/C2401/rstLst/',
 'https://tabelog.com/aomori/rstLst/MC/',
 'https://tabelog.com/aomori/rstLst/cond04-00-01/',
 'https://tabelog.com/aomori/A0203/A020301/R11609/rstLst/',
 'https://tabelog.com/aomori/rstLst/MC21/',
 'https://tabelog.com/aomori/A0203/A020301/2011528/',
 'https://tabelog.com/aomori/A0202/A020201/2008713/dtlrvwlst/B432614271/',
 'https://tabelog.com/aomori/A0205/A020502/2008632/dtlrvwlst/B106889387/',
 'https://tabelog.com/aomori/C2343/rstLst/',
 'https://tabelog.com/aomori/C2202/rstLst/',
 'https://tabelog.com/aomori/A0205/',
 'https://tabelog.com/aomori/C2208/rstLst/',
 'https://tabelog.com/aomori/rstLst/unagi/',
 'https://tabelog.com/aomori/C2361/rstLst/',
 'https://tabelog.com/aomori/A0201/A020101/2005741/dtlrvwlst/',
 'https://tabelog.com/aomori/A0201/A020101/2010629/dtlrvwlst/',
 'https://tabelog.com/aomori/C2443/rstLst/',
 'https://tabelog.com/aomori/A0202/A020201/2008713/',
 'https://tabelog.com/aomori/rstLst/CC06/',
 'https://tabelog.com/aomori/A0202/A020201/2011530/',
]

r = re.compile('https:\/\/tabelog\.com\/aomori\/A\d+\/A\d+\/\d+\/')
newlist = list(filter(r.match, url_list))
print (newlist)

样本输出：

['https://tabelog.com/aomori/A0203/A020301/2011528/', 'https://tabelog.com/aomori/A0202/A020201/2008713/dtlrvwlst/B432614271/', 'https://tabelog.com/aomori/A0205/A020502/2008632/dtlrvwlst/B106889387/', 'https://tabelog.com/aomori/A0201/A020101/2005741/dtlrvwlst/', 'https://tabelog.com/aomori/A0201/A020101/2010629/dtlrvwlst/', 'https://tabelog.com/aomori/A0202/A020201/2008713/', 'https://tabelog.com/aomori/A0202/A020201/2011530/']

赞(0）回复(0）举报 2021-09-08

1szpjjfi4#

使用 re :

import re

final_url_list = []

for url in url_list:
    m = re.findall('(.*A\d+/A\d+/\d+/$)', url)
    if len(m) > 0:
        final_url_list.append(m[0])

产生 final_url_list 属于

['https://tabelog.com/aomori/A0203/A020301/2011528/',
 'https://tabelog.com/aomori/A0202/A020201/2008713/',
 'https://tabelog.com/aomori/A0202/A020201/2011530/']

我发现https://pythex.org/ 在构造正则表达式时非常有用。

赞(0）回复(0）举报 2021-09-08

我来回答

如何找到与特定模式匹配的URL？

4条答案

相关问题

热门标签

最新问答