在Python中匹配并从URL中提取部分的正则表达式

m1m5dgzv  于 2023-10-14  发布在  Python
关注(0)|答案(4)|浏览(120)

我试图从完整的工件URL中获取工件示例名称,仓库名称和工件名称到3个变量中,如下所示。
"https://artifactory.intuit.veg.com:443/artifactory/annual-budget-local/manifests-approved/1.0.0/annual-chart/po09ij/annual-f3c.tgz"
"https://artifactory.skopeo.marvel.org/artifactory/bulletins_virtual/manifests-approved/po09ij/annual-f3c.tgz"
人工示例为-> artifactory.intuit.veg.comartifactory.skopeo.marvel.org
存储库名称为-> annual-budget-localbulletins_virtual
工件名称-> manifests-approved/1.0.0/annual-chart/po09ij/annual-f3c.tgzmanifests-approved/po09ij/annual-f3c.tgz
我可以使用split与多种组合,但我想了解如何有效地使用Python regex在这里,任何指导将是非常有用的。
我是否应该匹配单词artifactory前后的字符串,并执行额外的拆分操作以获得artifact name

mepcadol

mepcadol1#

需要指出的是,还有urllib.parse模块,用于拆分URL。没有理由重新发明轮子。

from urllib.parse import urlparse

urls = [ "https://artifactory.intuit.veg.com:443/artifactory/annual-budget-local/manifests-approved/1.0.0/annual-chart/po09ij/annual-f3c.tgz", "https://artifactory.skopeo.marvel.org/artifactory/bulletins_virtual/manifests-approved/po09ij/annual-f3c.tgz"]

for url in urls:
    o = urlparse(url)
    instance, repo = o.hostname, o.path.split('/')[2]
    print(instance, repo)
syqv5f0l

syqv5f0l2#

Tyr这个:

import re

def extract_artifactory_data(url):
    pattern = r"https://(?P<instance>[^:/]+)(?::\d+)?/artifactory/(?P<repo>[^/]+)/(?P<artifact>.+)"
    match = re.match(pattern, url)
    
    if not match:
        return None
    
    return match.group("instance"), match.group("repo"), match.group("artifact")

url1 = "https://artifactory.intuit.veg.com:443/artifactory/annual-budget-local/manifests-approved/1.0.0/annual-chart/po09ij/annual-f3c.tgz"
url2 = "https://artifactory.skopeo.marvel.org/artifactory/bulletins_virtual/manifests-approved/po09ij/annual-f3c.tgz"

instance1, repo1, artifact1 = extract_artifactory_data(url1)
instance2, repo2, artifact2 = extract_artifactory_data(url2)

print(instance1, repo1, artifact1)
print(instance2, repo2, artifact2)
qvsjd97n

qvsjd97n3#

下面是一个代码示例来分割URL,如您所述:

import re

# Sample URLs
urls = [
    "https://artifactory.intuit.veg.com:443/artifactory/annual-budget-local/manifests-approved/1.0.0/annual-chart/po09ij/annual-f3c.tgz",
    "https://artifactory.skopeo.marvel.org/artifactory/bulletins_virtual/manifests-approved/po09ij/annual-f3c.tgz"
]

for url in urls:
    match = re.search(r'https://([^/]+).+?/([^/]+)/(.+)$', url)
    if match:
        instance_name, repository_name, artifact_name = match.groups()
    else:
        instance_name, repository_name, artifact_name = "N/A", "N/A", "N/A"

    print("Artifactory Instance:", instance_name)
    print("Repository Name:", repository_name)
    print("Artifact Name:", artifact_name)

对于正则表达式https://([^/]+).+?/([^/]+)/(.+)$

https://:这部分模式匹配URL开头的文字字符“https://”。
([^/]+):这是匹配一个或多个不是正斜杠(/)的字符的捕获组。它被括在括号中,这意味着匹配的内容将被捕获,并可以在以后提取。
.+?/:模式的这一部分匹配一个或多个字符(.+?)后跟正斜杠(/)。.+?是一个非贪婪匹配,这意味着它将匹配尽可能少的字符,同时仍然允许模式的其余部分匹配。
([^/]+):与第一个捕获组类似,它匹配一个或多个不是正斜杠的字符并捕获它们。
(.+)$:模式的这一部分匹配一个或多个字符后跟行尾($)并捕获它们。这允许它捕获第二个捕获组之后的所有内容,直到URL结束。

search函数使用正则表达式从输入字符串中匹配instance_name、repository_name和artifact_name组。

olmpazwi

olmpazwi4#

与Aymen Azouis解决方案非常相似,但进行了小优化。
1.使用regex库,imho应该优先于re
1.用于检测http://https://的选项
1.所有格量词

(?x)
^                                  # start of pattern
https?                             # http with an optional s
://
(?P<artifactory_instance>[^/:]++)  # capture everything up to the next ":" or "/"
(?::\d++)?                         # if you encounter a port match it (optional)
/artifactory/
(?P<repository>[^/]++)             # match repository by capturing everything up to next "/"
/
(?P<artifact_names>.++)            # match the rest of URL to artifact names
$

在regex101(https://regex101.com/r/7Ww4ui/1)上,所有格量词被省略,因为re模块不处理它们(这是在rexex101上实现的)。
或者作为可执行代码:

import regex 

def extract_artifactory_data(url):
    pattern = r"^https?://(?P<artifactory_instance>[^/:]++)(?::\d++)?/artifactory/(?P<repository>[^/]++)/(?P<artifact_names>.++)$"
    match = regex.match(pattern, url)
    
    if not match:
        return None
    
    return match.group("artifactory_instance"), match.group("repository"), match.group("artifact_names")

url1 = "https://artifactory.intuit.veg.com:443/artifactory/annual-budget-local/manifests-approved/1.0.0/annual-chart/po09ij/annual-f3c.tgz"
url2 = "https://artifactory.skopeo.marvel.org/artifactory/bulletins_virtual/manifests-approved/po09ij/annual-f3c.tgz"

instance1, repo1, artifact1 = extract_artifactory_data(url1)
instance2, repo2, artifact2 = extract_artifactory_data(url2)

print(instance1, repo1, artifact1)
print(instance2, repo2, artifact2)

相关问题