css Beautiful Soup没有给我从Python网页中提取标题的预期结果,我该怎么办?

ivqmmu1c  于 2023-06-07  发布在  Python
关注(0)|答案(1)|浏览(150)

我一直试图提取所有的标题从这个网页,:-https://www.ycombinator.com/companies?industry=B2B%20Software%20and%20Services&status=Active&status=Public&status=Inactive&tags=Fintech&tags=Developer%20Tools&tags=Artificial%20Intelligence&tags=Analytics.例如(gitlab、deel、fivetran等)。我正在使用beautifulSoup包来完成这个任务,但是它没有给予我正确的结果。

import requests
from bs4 import BeautifulSoup

url = 'https://www.ycombinator.com/companies?industry=B2B%20Software%20and%20Services&status=Active&status=Public&status=Inactive&tags=Fintech&tags=Developer%20Tools&tags=Artificial%20Intelligence&tags=Analytics'

response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

a_tags = soup.find_all('a')
for a_tag in a_tags:
    div_tag = a_tag.find('div', class_='right')
    if div_tag:
        span_tags = div_tag.find_all('span')
        for span_tag in span_tags:
            text = span_tag.get_text(strip=True)
            print(text)

我用这个来完成任务。我首先找出页面上的所有标签,然后进入标签,最后进入,我想我想要的标题都在那里。但是,它仍然没有显示任何结果。有人知道怎么解决吗?

jtw3ybtb

jtw3ybtb1#

你不会得到任何结果,因为你正在寻找的标题(GitLab,Deel,Fivetran等)是搜索结果的一部分,涉及JavaScript,需要几秒钟才能加载到页面上。简单的requests是不可能做到的,因为它不支持javascript。但是您可以使用Selenium实现相同的功能。

以下是如何使用Selenium解决这个问题:

from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
import selenium.webdriver.support.expected_conditions as EC

driver = Chrome()
url = "https://www.ycombinator.com/companies?industry=B2B%20Software%20and%20Services&status=Active&status=Public&status=Inactive&tags=Fintech&tags=Developer%20Tools&tags=Artificial%20Intelligence&tags=Analytics"
driver.get(url)

results = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, 'a.WxyYeI15LZ5U_DOM0z8F')))

print(f"total search results: {len(results)}")
for result in results:
    print(result.find_element(By.CSS_SELECTOR, 'div.right>div>span').text)

输出:

total search results: 40
GitLab
Deel
Fivetran
Checkr
Retool
Podium
Algolia
Modern Treasury
Sift
Pave
Mux
Jasper.ai
Apollo
Mashgin
Veriff
Airbyte
Teleport
People.ai
Sendbird
Mixpanel
Human Interest
Heap
Frubana Inc
Replit
QuickNode
Middesk
Supabase
TRM Labs
InfluxData
Bitrise
Hightouch
Instabug
Mezmo
Routable
HackerRank
Duffel
Deepgram
RevenueCat
TrueNorth
Nuvocargo

相关问题