我无法从美丽的汤中获取所有html数据

wswtfjt7 于 2022-11-20 发布在其他

关注(0)|答案(1)|浏览(140)

我在webscraping新，我想从谷歌页面上只得到一个文本（基本上是足球比赛的日期），但汤没有得到所有的html（IM gessing beacause of request），所以我找不到它，我知道它可以beacause谷歌使用javascript，我应该使用 selenium chromedriver，但问题是，我需要的代码是在另一台计算机上使用，所以它不能真正使用它。
下面是代码：

import pandas as pd
from bs4 import BeautifulSoup
import requests

a = "Newcastle"
url ="https://www.google.com/search?q=" + a + "+next+match"

response = requests.get(url)
soup = BeautifulSoup(response.text,"html.parser")

print(soup)

for a in soup.findAll('div') :
    print(soup.get_text())

我想找的是

"<span class="imso_mh__lr-dt-ds">17/12, 13:30</span>"

它具有

"//*[@id="sports-app"]/div/div[3]/div[1]/div/div/div/div/div[1]/div/div[1]/div/span[2]"

作为xpath
这可能吗？

Html

来源：https://stackoverflow.com/questions/74466121/i-cant-get-all-the-html-data-from-beautiful-soup

1条答案

按热度按时间

jdzmm42g1#

从Google请求页面时尝试设置User-Agent标题：

import requests
from bs4 import BeautifulSoup

a = "Newcastle"
url = "https://www.google.com/search?q=" + a + "+next+match&hl=en"

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:106.0) Gecko/20100101 Firefox/106.0"
}

soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")

next_match = soup.select_one('[data-entityname="Match Header"]')
for t in next_match.select('[aria-hidden="true"]'):
    t.extract()

text = next_match.get_text(strip=True, separator=" ")
print(text)

印刷品：

Club Friendlies · Dec 17, 13:30 Newcastle VS Vallecano

赞(0）回复(0）举报 2022-11-20

我来回答

我无法从美丽的汤中获取所有html数据

1条答案

相关问题

热门标签

最新问答