我有绳子
description = "----------------------------------------------\n\nTest Customer, May 11, 2023, 18:27\n\nDo you want to hear a construction joke? Oh, never mind, I'm still working on it!\n\nAttachment(s):\nistockphoto-1131743616-612x612.jpg - https://aaa.zendesk.com/attachments/token/jIUzlZ4ylKke4iP0kB/?name=istockphoto-1131743616-612x612.jpgzendesk_tickets_20230511_091412058891.json - https://aaa.zendesk.com/attachments/token/jIUzlZ4ylKke4iP0kB/?name=zendesk_tickets_20230511_091412058891.json"
我想从 * 附件 * 中提取文件名和文件URL。
问题是文件URL与下一个文件的文件名连接在一起。
这是我现在的代码:
import re
description = "----------------------------------------------\n\nTest Customer, May 11, 2023, 18:27\n\nDo you want to hear a construction joke? Oh, never mind, I'm still working on it!\n\nAttachment(s):\nistockphoto-1131743616-612x612.jpg - https://aaa.zendesk.com/attachments/token/jIUzlZ4ylKke4iP0kB/?name=istockphoto-1131743616-612x612.jpgzendesk_tickets_20230511_091412058891.json - https://aaa.zendesk.com/attachments/token/jIUzlZ4ylKke4iP0kB/?name=zendesk_tickets_20230511_091412058891.json"
# Extract the attachment section from the description
attachment_section = re.search(r'Attachment\(s\):\n(.+)', description, re.DOTALL).group(1)
# Extract the filenames and URLs using regular expressions
attachments = re.findall(r'([^\s]+)\s-\s(https?://\S+?(?=\s-\s\S+|$))', attachment_section)
# Process each attachment to extract the filename and URL
attachment_urls = []
for i in range(len(attachments)):
file_name = attachments[i][0]
url = attachments[i][1]
attachment_urls.append({
'file_name': file_name,
'url': url
})
# Print the extracted file names and URLs
for attachment in attachment_urls:
print(f"File Name: {attachment['file_name']}")
print(f"URL: {attachment['url']}")
print()
我的预期结果是:
- File Name:Zhengyangmen,Beijing.JPG
- 文件名:zendesk_tickets_20230511_091412058891.json
网址:https://aaa.zendesk.com/attachments/token/jIUzlZ4ylKke4iP0kB/?name=zendesk_tickets_20230511_091412058891.json
附件部分可以包含无限数量的文件。
有人能帮助我解决这个问题吗?非常感谢你的帮助。
2条答案
按热度按时间mspsb9vt1#
您可以从文件名中捕获扩展名,并使用对group 2的反向引用来匹配url中的扩展名。
首先检查group1是否存在,然后可以使用re.finditer来使用捕获group1和group3的值。
模式匹配:
(
捕获组1\S+
匹配1+非空格字符(\.[^\s.]+)
捕获组2,匹配.
和1+除点以外的非空白字符)
关闭组1\s+-\s+
在1+个空格字符之间匹配-
(
捕获组3\S+?\2
匹配1+尽可能少的非空格字符,然后将一个反向引用匹配到组2中捕获的内容)
关闭组3参见regex demo和Python demo。
输出
368yc8dk2#
一个 * 简单/天真 * 的方法是对文件扩展名进行harcode:
Regex [ demo ]
输出: