如何使用Python从网页中提取特定链接?

r1wp621o  于 2023-06-04  发布在  Python
关注(0)|答案(2)|浏览(591)

我想从一个网页使用Python拉特定的链接。在我下面的例子中,我正在查看SEC网站上的8-K表格,其中有几个链接。一个新闻稿的链接,也是一个目录的链接。
在这里,我只希望被认为是展品的链接。任何8-K表格上的所有展品都应属于“第9.01项”。财务报表和附件部分。
下面的代码将获得8-K的所有链接,但我只希望在展览部分的链接。

import requests
from bs4 import BeautifulSoup

# Provide the URL and Headers
url = "https://www.sec.gov/Archives/edgar/data/707549/000070754923000005/lrcx-20230123.htm"
headers = {"User-Agent":"INSERT YOUR USER AGENT INFO HERE"}

# Send a GET request to retrieve the HTML content
response = requests.get(url,headers=headers)
html_content = response.text

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html_content, "html.parser")

# Find all the links in the HTML
all_links = soup.find_all("a")

# Extract the URLs from the links and print them
for link in all_links:
    url = link.get("href")
    print(url)
carvr3hs

carvr3hs1#

我找不到任何像classid这样的过滤器字段,以便我可以过滤特定的展品a tags
但是,我注意到展示URL上有单词“exhibit”,所以下面的代码可以找到所有这些展示URL。

# Extract the URLs from the links and print them
base_endpoint = '/'.join(url.split('/')[:-1])
for link in all_links:
    a_url = link.get("href")
    if 'exhibit' in url:
        print(f'{base_endpoint}/{a_url}')
pexxcrt2

pexxcrt22#

查看页面,您可以搜索所有在href=中包含单词exibit的链接:

import requests
from bs4 import BeautifulSoup

url = 'https://www.sec.gov/Archives/edgar/data/707549/000070754923000005/lrcx-20230123.htm'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/113.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')

for link in soup.select('[href*="exhibit"]'):
    print(link.text)
    print(url.rsplit('/', maxsplit=1)[0] + '/' + link['href'])
    print()

图纸:

Press Release dated January 25, 2023 announcing financial results for the fiscal quarter ended December 25, 2022
https://www.sec.gov/Archives/edgar/data/707549/000070754923000005/lrcx_exhibitx991xq2x2023.htm

编辑:要删除重复项,您可以使用例如set()

import requests
from bs4 import BeautifulSoup

url = 'https://www.sec.gov/Archives/edgar/data/707549/000070754923000005/lrcx-20230123.htm'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/113.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')

base_url = url.rsplit('/', maxsplit=1)[0] + '/'

out = set()
for link in soup.select('[href*="exhibit"]'):
    out.add(base_url + link['href'])

print(*out, sep='\n')

图纸:

https://www.sec.gov/Archives/edgar/data/707549/000070754923000005/lrcx_exhibitx991xq2x2023.htm

相关问题