如何使用BeautifulSoup通过div class_= css-gz 8dae查找“Description”?

7kjnsjlb  于 2023-03-24  发布在  其他
关注(0)|答案(2)|浏览(108)

我是Python新手,我正在学习用于抓取目的。我正在使用BeautifulSoup从以下职位招聘中收集描述:https://justjoin.it/offers/itds-net-fullstack-developer-angular
在另一个提供工作的网站上,使用相同的代码和不同的div类,我可以找到我需要的东西。justjoin.it

import requests
from bs4 import BeautifulSoup

link="https://justjoin.it/offers/jungle-devops-engineer"

response_IDs=requests.get(link)
soup=BeautifulSoup(response_IDs.text, 'html.parser')
Search_part = soup.find(id='root')
description= Search_part.find_all('div', class_='css-gz8dae')

for i in description:
    print(i)

请帮我写一个函数代码。

wa7juj8i

wa7juj8i1#

正如Pawel Kam和cconsta1所解释的那样,为了使网站完全呈现,需要执行一系列JS。如果你想要网站的全部HTML,那么就使用selenium(cconsta1在他们的回答中详细说明了这一点)。但是如果你只想要招聘信息中的描述部分,那么下面的解决方案可能更合适。

获取包含作业描述信息的JSON文件。

使用浏览器的开发工具,我发现网站向this API发出GET请求,以获取您在招聘信息上看到的所有信息。具体来说,对请求的响应是JSON。
因此,如果你只想在招聘信息中显示数据,你所要做的就是请求JSON文件,然后使用BeautifulSoup解析它以获得你想要的特定数据。
我发现this article很有用,当我第一次学习通过“逆向工程”网站请求的网络抓取时。
以下脚本可用于获取JSON文件并解析Description部分的HTML:

import requests
import json
from bs4 import BeautifulSoup

def pretty_print_json(json_obj):
    json_string = json.dumps(json_obj, indent=4)
    print(json_string)

def get_json(url, req_headers):
    response = requests.get(url, headers=req_headers)

    # makes JSON file into dict object
    return response.json()
    
def find_first_element(html, tag):
        soup = BeautifulSoup(html, 'html.parser')

        # find first occurance of given element
        element = soup.find(tag)
        return element

def pretty_print_html(html):
    soup = BeautifulSoup(html, 'html.parser')
    print(soup.prettify())

if __name__ == "__main__":

    url = "https://justjoin.it/api/offers/itds-net-fullstack-developer-angular"
    api_headers = {
        "X-CSRF-Token": "/w2ocZnRs5LN43gzQsi8zWYcdAOVmhjBEpB/dduBn5rnhzjqOnvlo7SsrEdf5Rht3Aa2x/+/00OZJuh3tgmaDA=="
    }
    json_obj = get_json(url, api_headers)

    # view entire JSON file (in a readable format) 
    # to familiarize yourself where its structure
    pretty_print_json(json_obj)

    # access HTML that makes up Description section of job posting
    job_description_html = json_obj['body']

    # look at job description html
    pretty_print_html(job_description_html)

    # get the job summary (i.e. the opening paragraph of Description section) 
    job_summary = find_first_element(job_description_html, 'div').text
    print(job_summary)

其他的打印输出比较大,所以我只显示print(job_summary)的输出:

As a .NET FullStack Developer (Angular) you will be working on implementing innovative 
architectural solutions for our client in the banking sector. Our client is the first 
fully online bank in Poland, setting directions for the development of mobile and online 
banking. It is one of the strongest and fastest growing financial brands in Poland. Your 
key responsibilities: 

你得摆弄它才能得到你想要的确切信息。如果你需要我澄清什么就告诉我。

fykwrbwg

fykwrbwg2#

正如评论中提到的,问题是这个网站上的内容是使用JavaScript呈现的,所以请求将无法抓取动态内容。Selenium可以解决这个问题,因为它使用Web驱动程序来呈现/执行JavaScript。
首先,确保您已经安装了Selenium:

pip install selenium

对于google colab,请在pip install前面添加!(见下文)。
正如我提到的,我在google colab上运行我所有的python,它使用的是FireFox。这对我来说很有效:

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.options import Options

link = "https://justjoin.it/offers/jungle-devops-engineer"

# Set up headless browser (no GUI)
options = Options()
options.headless = True
browser = webdriver.Firefox(options=options)

# Use Selenium to get the page source after JavaScript has executed
browser.get(link)
page_source = browser.page_source
browser.quit()

# Use BeautifulSoup to parse the HTML
soup = BeautifulSoup(page_source, 'html.parser')
description = soup.find_all('div', class_='css-gz8dae')

for i in description:
    print(i.text)

以下是输出:

Running a flexible Machine Learning engine at scale is hard. 
We must ingest and process large volumes of data 
uninterruptedly and store it in a scalable manner. 
The data needs to be prepared and served to hundreds of 
models constantly. All the predictions of the models, as well as other data pipelines, ...

如果你使用 chrome 更改这一行

browser = webdriver.Firefox(options=options)

用这个

browser = webdriver.Chrome(options=options)

要在Google Colab上运行整个程序,你需要先安装Selenium和Firefox,如下所示:

!pip install selenium
!apt-get update
!apt install -y firefox
!apt install -y wget
!apt install -y unzip

然后,您还需要GeckoDriver,它应该在系统的PATH中设置:

!wget https://github.com/mozilla/geckodriver/releases/download/v0.30.0/geckodriver-v0.30.0-linux64.tar.gz
!tar -xvf geckodriver-v0.30.0-linux64.tar.gz
!chmod +x geckodriver
!mv geckodriver /usr/local/bin/

安装完成后,运行上面的代码。

相关问题