如何使用Python从网页中提取特定链接？

r1wp621o 于 2023-06-04 发布在 Python

关注(0)|答案(2)|浏览(591)

我想从一个网页使用Python拉特定的链接。在我下面的例子中，我正在查看SEC网站上的8-K表格，其中有几个链接。一个新闻稿的链接，也是一个目录的链接。
在这里，我只希望被认为是展品的链接。任何8-K表格上的所有展品都应属于“第9.01项”。财务报表和附件部分。
下面的代码将获得8-K的所有链接，但我只希望在展览部分的链接。

import requests
from bs4 import BeautifulSoup

# Provide the URL and Headers
url = "https://www.sec.gov/Archives/edgar/data/707549/000070754923000005/lrcx-20230123.htm"
headers = {"User-Agent":"INSERT YOUR USER AGENT INFO HERE"}

# Send a GET request to retrieve the HTML content
response = requests.get(url,headers=headers)
html_content = response.text

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html_content, "html.parser")

# Find all the links in the HTML
all_links = soup.find_all("a")

# Extract the URLs from the links and print them
for link in all_links:
    url = link.get("href")
    print(url)

python

来源：https://stackoverflow.com/questions/76393407/how-can-i-pull-specific-links-from-a-webpage-using-python

2条答案

按热度按时间

carvr3hs1#

我找不到任何像class或id这样的过滤器字段，以便我可以过滤特定的展品a tags。
但是，我注意到展示URL上有单词“exhibit”，所以下面的代码可以找到所有这些展示URL。

# Extract the URLs from the links and print them
base_endpoint = '/'.join(url.split('/')[:-1])
for link in all_links:
    a_url = link.get("href")
    if 'exhibit' in url:
        print(f'{base_endpoint}/{a_url}')

赞(0）回复(0）举报 2023-06-04

pexxcrt22#

查看页面，您可以搜索所有在href=中包含单词exibit的链接：

import requests
from bs4 import BeautifulSoup

url = 'https://www.sec.gov/Archives/edgar/data/707549/000070754923000005/lrcx-20230123.htm'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/113.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')

for link in soup.select('[href*="exhibit"]'):
    print(link.text)
    print(url.rsplit('/', maxsplit=1)[0] + '/' + link['href'])
    print()

图纸：

Press Release dated January 25, 2023 announcing financial results for the fiscal quarter ended December 25, 2022
https://www.sec.gov/Archives/edgar/data/707549/000070754923000005/lrcx_exhibitx991xq2x2023.htm

编辑：要删除重复项，您可以使用例如set()：

import requests
from bs4 import BeautifulSoup

url = 'https://www.sec.gov/Archives/edgar/data/707549/000070754923000005/lrcx-20230123.htm'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/113.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')

base_url = url.rsplit('/', maxsplit=1)[0] + '/'

out = set()
for link in soup.select('[href*="exhibit"]'):
    out.add(base_url + link['href'])

print(*out, sep='\n')

图纸：

https://www.sec.gov/Archives/edgar/data/707549/000070754923000005/lrcx_exhibitx991xq2x2023.htm

赞(0）回复(0）举报 2023-06-04

我来回答

如何使用Python从网页中提取特定链接？

2条答案

相关问题

热门标签

最新问答