regex 需要帮助从字符串中提取文件名和文件url

bjg7j2ky  于 2023-05-19  发布在  其他
关注(0)|答案(2)|浏览(106)

我有绳子

description = "----------------------------------------------\n\nTest Customer, May 11, 2023, 18:27\n\nDo you want to hear a construction joke? Oh, never mind, I'm still working on it!\n\nAttachment(s):\nistockphoto-1131743616-612x612.jpg - https://aaa.zendesk.com/attachments/token/jIUzlZ4ylKke4iP0kB/?name=istockphoto-1131743616-612x612.jpgzendesk_tickets_20230511_091412058891.json - https://aaa.zendesk.com/attachments/token/jIUzlZ4ylKke4iP0kB/?name=zendesk_tickets_20230511_091412058891.json"

我想从 * 附件 * 中提取文件名和文件URL。
问题是文件URL与下一个文件的文件名连接在一起。
这是我现在的代码:

import re

description = "----------------------------------------------\n\nTest Customer, May 11, 2023, 18:27\n\nDo you want to hear a construction joke? Oh, never mind, I'm still working on it!\n\nAttachment(s):\nistockphoto-1131743616-612x612.jpg - https://aaa.zendesk.com/attachments/token/jIUzlZ4ylKke4iP0kB/?name=istockphoto-1131743616-612x612.jpgzendesk_tickets_20230511_091412058891.json - https://aaa.zendesk.com/attachments/token/jIUzlZ4ylKke4iP0kB/?name=zendesk_tickets_20230511_091412058891.json"

# Extract the attachment section from the description
attachment_section = re.search(r'Attachment\(s\):\n(.+)', description, re.DOTALL).group(1)

# Extract the filenames and URLs using regular expressions
attachments = re.findall(r'([^\s]+)\s-\s(https?://\S+?(?=\s-\s\S+|$))', attachment_section)

# Process each attachment to extract the filename and URL
attachment_urls = []
for i in range(len(attachments)):
    file_name = attachments[i][0]
    url = attachments[i][1]
    attachment_urls.append({
        'file_name': file_name,
        'url': url
    })

# Print the extracted file names and URLs
for attachment in attachment_urls:
    print(f"File Name: {attachment['file_name']}")
    print(f"URL: {attachment['url']}")
    print()

我的预期结果是:

  • File Name:Zhengyangmen,Beijing.JPG

网址:https://aaa.zendesk.com/attachments/token/jIUzlZ4ylKke4iP0kB/?name=istockphoto-1131743616-612x612.jpg

  • 文件名:zendesk_tickets_20230511_091412058891.json

网址:https://aaa.zendesk.com/attachments/token/jIUzlZ4ylKke4iP0kB/?name=zendesk_tickets_20230511_091412058891.json
附件部分可以包含无限数量的文件。
有人能帮助我解决这个问题吗?非常感谢你的帮助。

mspsb9vt

mspsb9vt1#

您可以从文件名中捕获扩展名,并使用对group 2的反向引用来匹配url中的扩展名。
首先检查group1是否存在,然后可以使用re.finditer来使用捕获group1和group3的值。

(\S+(\.[^\s.]+))\s+-\s+(\S+?\2)

模式匹配:

  • (捕获组1
  • \S+匹配1+非空格字符
  • (\.[^\s.]+)捕获组2,匹配.和1+除点以外的非空白字符
  • )关闭组1
  • \s+-\s+在1+个空格字符之间匹配-
  • (捕获组3
  • \S+?\2匹配1+尽可能少的非空格字符,然后将一个反向引用匹配到组2中捕获的内容
  • )关闭组3

参见regex demoPython demo

import re

description = "----------------------------------------------\n\nTest Customer, May 11, 2023, 18:27\n\nDo you want to hear a construction joke? Oh, never mind, I'm still working on it!\n\nAttachment(s):\nistockphoto-1131743616-612x612.jpg - https://aaa.zendesk.com/attachments/token/jIUzlZ4ylKke4iP0kB/?name=istockphoto-1131743616-612x612.jpgzendesk_tickets_20230511_091412058891.json - https://aaa.zendesk.com/attachments/token/jIUzlZ4ylKke4iP0kB/?name=zendesk_tickets_20230511_091412058891.json"

attachment_section = re.search(r'Attachment\(s\):\n(.+)', description, re.DOTALL)
attachment_urls = []
if attachment_section:
    matches = re.finditer(r"(\S+(\.[^\s.]+))\s+-\s+(\S+?\2)", attachment_section.group(1))
    for _, match in enumerate(matches, start=1):
        attachment_urls.append({
            'file_name': match.group(1),
            'url': match.group(3)
        })

for attachment in attachment_urls:
    print(f"File Name: {attachment['file_name']}")
    print(f"URL: {attachment['url']}")
    print()

输出

File Name: istockphoto-1131743616-612x612.jpg
URL: https://aaa.zendesk.com/attachments/token/jIUzlZ4ylKke4iP0kB/?name=istockphoto-1131743616-612x612.jpg

File Name: zendesk_tickets_20230511_091412058891.json
URL: https://aaa.zendesk.com/attachments/token/jIUzlZ4ylKke4iP0kB/?name=zendesk_tickets_20230511_091412058891.json
368yc8dk

368yc8dk2#

一个 * 简单/天真 * 的方法是对文件扩展名进行harcode:

f_exts = "jpg|json" #add here more extensions..
​
pat = fr"(\S+) - (\S+?\.(?:{f_exts}))"
attachments = re.findall(pat, description)
​
for file_name, url in attachments:
    print("File Name:", file_name)
    print("URL:", url, "\n")

Regex [ demo ]
输出:

File Name: istockphoto-1131743616-612x612.jpg
URL: https://aaa.zendesk.com/attachments/token/jIUzlZ4ylKke4iP0kB/?name=istockphoto-1131743616-612x612.jpg 

File Name: zendesk_tickets_20230511_091412058891.json
URL: https://aaa.zendesk.com/attachments/token/jIUzlZ4ylKke4iP0kB/?name=zendesk_tickets_20230511_091412058891.json

相关问题