在Python中匹配并从URL中提取部分的正则表达式

m1m5dgzv 于 2023-10-14 发布在 Python

关注(0)|答案(4)|浏览(120)

我试图从完整的工件URL中获取工件示例名称，仓库名称和工件名称到3个变量中，如下所示。
"https://artifactory.intuit.veg.com:443/artifactory/annual-budget-local/manifests-approved/1.0.0/annual-chart/po09ij/annual-f3c.tgz"
"https://artifactory.skopeo.marvel.org/artifactory/bulletins_virtual/manifests-approved/po09ij/annual-f3c.tgz"
人工示例为-> artifactory.intuit.veg.com和artifactory.skopeo.marvel.org
存储库名称为-> annual-budget-local和bulletins_virtual
工件名称-> manifests-approved/1.0.0/annual-chart/po09ij/annual-f3c.tgz和manifests-approved/po09ij/annual-f3c.tgz
我可以使用split与多种组合，但我想了解如何有效地使用Python regex在这里，任何指导将是非常有用的。
我是否应该匹配单词artifactory前后的字符串，并执行额外的拆分操作以获得artifact name？

python

来源：https://stackoverflow.com/questions/77271952/regular-expression-to-match-and-extract-parts-from-an-url-in-python

4条答案

按热度按时间

mepcadol1#

需要指出的是，还有urllib.parse模块，用于拆分URL。没有理由重新发明轮子。

from urllib.parse import urlparse

urls = [ "https://artifactory.intuit.veg.com:443/artifactory/annual-budget-local/manifests-approved/1.0.0/annual-chart/po09ij/annual-f3c.tgz", "https://artifactory.skopeo.marvel.org/artifactory/bulletins_virtual/manifests-approved/po09ij/annual-f3c.tgz"]

for url in urls:
    o = urlparse(url)
    instance, repo = o.hostname, o.path.split('/')[2]
    print(instance, repo)

赞(0）回复(0）举报 2023-10-14

syqv5f0l2#

Tyr这个：

import re

def extract_artifactory_data(url):
    pattern = r"https://(?P<instance>[^:/]+)(?::\d+)?/artifactory/(?P<repo>[^/]+)/(?P<artifact>.+)"
    match = re.match(pattern, url)
    
    if not match:
        return None
    
    return match.group("instance"), match.group("repo"), match.group("artifact")

url1 = "https://artifactory.intuit.veg.com:443/artifactory/annual-budget-local/manifests-approved/1.0.0/annual-chart/po09ij/annual-f3c.tgz"
url2 = "https://artifactory.skopeo.marvel.org/artifactory/bulletins_virtual/manifests-approved/po09ij/annual-f3c.tgz"

instance1, repo1, artifact1 = extract_artifactory_data(url1)
instance2, repo2, artifact2 = extract_artifactory_data(url2)

print(instance1, repo1, artifact1)
print(instance2, repo2, artifact2)

赞(0）回复(0）举报 2023-10-14

qvsjd97n3#

下面是一个代码示例来分割URL，如您所述：

import re

# Sample URLs
urls = [
    "https://artifactory.intuit.veg.com:443/artifactory/annual-budget-local/manifests-approved/1.0.0/annual-chart/po09ij/annual-f3c.tgz",
    "https://artifactory.skopeo.marvel.org/artifactory/bulletins_virtual/manifests-approved/po09ij/annual-f3c.tgz"
]

for url in urls:
    match = re.search(r'https://([^/]+).+?/([^/]+)/(.+)$', url)
    if match:
        instance_name, repository_name, artifact_name = match.groups()
    else:
        instance_name, repository_name, artifact_name = "N/A", "N/A", "N/A"

    print("Artifactory Instance:", instance_name)
    print("Repository Name:", repository_name)
    print("Artifact Name:", artifact_name)

对于正则表达式https：//（[^/]+）.+？/（[^/]+）/（.+）$：

https：//：这部分模式匹配URL开头的文字字符“https：//”。
（[^/]+）：这是匹配一个或多个不是正斜杠（/）的字符的捕获组。它被括在括号中，这意味着匹配的内容将被捕获，并可以在以后提取。
.+？/：模式的这一部分匹配一个或多个字符（.+？）后跟正斜杠（/）。.+？是一个非贪婪匹配，这意味着它将匹配尽可能少的字符，同时仍然允许模式的其余部分匹配。
（[^/]+）：与第一个捕获组类似，它匹配一个或多个不是正斜杠的字符并捕获它们。
（.+）$：模式的这一部分匹配一个或多个字符后跟行尾（$）并捕获它们。这允许它捕获第二个捕获组之后的所有内容，直到URL结束。

search函数使用正则表达式从输入字符串中匹配instance_name、repository_name和artifact_name组。

赞(0）回复(0）举报 2023-10-14

olmpazwi4#

与Aymen Azouis解决方案非常相似，但进行了小优化。
1.使用regex库，imho应该优先于re
1.用于检测http://和https://的选项
1.所有格量词

(?x)
^                                  # start of pattern
https?                             # http with an optional s
://
(?P<artifactory_instance>[^/:]++)  # capture everything up to the next ":" or "/"
(?::\d++)?                         # if you encounter a port match it (optional)
/artifactory/
(?P<repository>[^/]++)             # match repository by capturing everything up to next "/"
/
(?P<artifact_names>.++)            # match the rest of URL to artifact names
$

在regex101（https://regex101.com/r/7Ww4ui/1）上，所有格量词被省略，因为re模块不处理它们（这是在rexex101上实现的）。
或者作为可执行代码：

import regex 

def extract_artifactory_data(url):
    pattern = r"^https?://(?P<artifactory_instance>[^/:]++)(?::\d++)?/artifactory/(?P<repository>[^/]++)/(?P<artifact_names>.++)$"
    match = regex.match(pattern, url)
    
    if not match:
        return None
    
    return match.group("artifactory_instance"), match.group("repository"), match.group("artifact_names")

url1 = "https://artifactory.intuit.veg.com:443/artifactory/annual-budget-local/manifests-approved/1.0.0/annual-chart/po09ij/annual-f3c.tgz"
url2 = "https://artifactory.skopeo.marvel.org/artifactory/bulletins_virtual/manifests-approved/po09ij/annual-f3c.tgz"

instance1, repo1, artifact1 = extract_artifactory_data(url1)
instance2, repo2, artifact2 = extract_artifactory_data(url2)

print(instance1, repo1, artifact1)
print(instance2, repo2, artifact2)

赞(0）回复(0）举报 2023-10-14

我来回答

在Python中匹配并从URL中提取部分的正则表达式

4条答案

相关问题

热门标签

最新问答