使用beautifulsoup，select()解析html

7gcisfzg 于 2021-08-20 发布在 Java

关注(0)|答案(1)|浏览(321)

我正在尝试使用beautifulsoup获取最新的帖子内容。
有时标 checkout 现在最近的帖子中，有时则不是。
我想得到标签，如果它在那里，如果它不在那里，只要得到其他文本。
我的代码如下。

import requests
from bs4 import BeautifulSoup

headers = 'User-Agent':'Mozilla/5.0'
url = "https:// " 
req = requests.get(url, headers=headers)
html = req.text       
soup = BeautifulSoup(html, 'html.parser')                
link = soup.select('#flagList > div.clear.ab-webzine > div > a')       
title = soup.select('#flagList > div.clear.ab-webzine > div > div.wz-item-header > a > span')         
latest_link = link[0] # link of latest post    
latest_title = title[0].text # title of latest post

# to get the text of latest post

t_url = latest_link
t_req = requests.get(t_url, headers=headers)
t_html = c_res.text
t_soup = BeautifulSoup(t_html, 'html.parser')  
maintext = t_soup.select ('#flagArticle > div.rhymix_content.xe_content')
tag = t_soup.select_one('div.rd.clear > div.rd_body.clear > ul > li > a').get_text()

print(maintext)
print(tag)

问题是，如果最近的文章中没有标记，它将返回如下错误。 AttributeError: 'NoneType' object has no attribute 'get_text' 如果我删除 .get_text() 从该代码中，如果标记不在最近的文章中，则返回 None 如果标签存在，它将返回 <a href="/posts?search_target=tag&search_keyword=ABC">ABC</a> 但我想得到公正 ABC 我如何解决这个问题？

python beautifulsoup

来源：https://stackoverflow.com/questions/68329123/parsing-html-using-beautifulsoup-select

1条答案

按热度按时间

pxiryf3j1#

试试这个

import requests
from bs4 import BeautifulSoup

headers = 'User-Agent':'Mozilla/5.0'
url = "https:// " 
req = requests.get(url, headers=headers)
html = req.text       
soup = BeautifulSoup(html, 'html.parser')                
link = soup.select('#flagList > div.clear.ab-webzine > div > a')       
title = soup.select('#flagList > div.clear.ab-webzine > div > div.wz-item-header > a > span')         
latest_link = link[0] # link of latest post    
latest_title = title[0].text # title of latest post

# to get the text of latest post

t_url = latest_link
t_req = requests.get(t_url, headers=headers)
t_html = c_res.text
t_soup = BeautifulSoup(t_html, 'html.parser')  
maintext = t_soup.select ('#flagArticle > div.rhymix_content.xe_content')
try:
    tag = t_soup.select_one('div.rd.clear > div.rd_body.clear > ul > li > a').text
    print(tag)
except:
    print("Sure the tag exists on this page??")

print(maintext)

赞(0）回复(0）举报 2021-08-20

我来回答

使用beautifulsoup，select()解析html

1条答案

相关问题

热门标签

最新问答