使用Selenium和BeautifulSoup的LinkedIn在Web抓取期间更改HTML标签

ne5o7dgx 于 2023-03-23 发布在其他

关注(0)|答案(2)|浏览(161)

我有一个问题，我不能刮教育，经验部分在LinkedIn的个人资料使用 selenium 和BeautifulSoup。
现在，我已经成功地刮名字，标题和位置.但对于教育和经验部分，我注意到，当我打开inspect时，html标签中有变化，这让我很难识别部分并使用beautifulSoup提取.有人有解决方案吗？这里的代码的例子：

experience = soup.find("section", {"id": "experience-section"}).find('ul')

print(experience)

li_tags = experience.find('div')
a_tags = li_tags.find("a")
job_title = a_tags.find("h3").get_text().strip()
 
print(job_title)
 
company_name = a_tags.find_all("p")[1].get_text().strip()
print(company_name)
 
joining_date = a_tags.find_all("h4")[0].find_all("span")[1].get_text().strip()
    employment_duration = a_tags.find_all("h4")[1].find_all("span")[1].get_text().strip()
 
print(joining_date + ", " + employment_duration)

here you can see the section id, where the number is changing
the inspect that i expect should be like this

selenium

来源：https://stackoverflow.com/questions/75773779/html-tags-changes-during-web-scraping-linkedin-using-selenium-and-beautifulsoup

2条答案

按热度按时间

smdncfj31#

你可能会发现它很有用。下面的脚本首先使用邮件和密码登录到LinkedIn，然后通过单击个人资料头像进入个人资料部分，最后获取个人资料的页面源，并使用beautifulsoup对其进行解析。

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver import ChromeOptions, Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

options = ChromeOptions()

# maximized and disable forbar
options.add_argument("--start-maximized")
options.add_experimental_option("useAutomationExtension", False)
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option(
    "prefs",
    {
        "credentials_enable_service": False,
        "profile.password_manager_enabled": False,
        "profile.default_content_setting_values.notifications": 2
        # with 2 should disable/block notifications and 1 to allow
    },
)

driver = webdriver.Chrome(options=options)

url = "https://www.linkedin.com/uas/login"
driver.get(url)
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID,"organic-div")))
container = driver.find_element(By.ID, "organic-div")

# login: fill the email account, password
email = container.find_element(By.ID, 'username')
password = container.find_element(By.ID, 'password')
email.send_keys("xxxxxxxxxxxxxxxx")
password.send_keys("xxxxxxxxxxxxxx")
password.send_keys(Keys.ENTER)
time.sleep(2)

WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "authentication-outlet")))
driver.find_element(By.CLASS_NAME, 'share-box-feed-entry__avatar').click()

time.sleep(2)

soup = BeautifulSoup(driver.page_source, 'lxml')

experience_div = soup.find('div', {"id": "experience"})
exp_list = experience_div.findNext('div').findNext('div', {"class": "pvs-list__outer-container"}).findChild('ul').findAll('li')

experiences = []

for each_exp in exp_list:

    company_logo = each_exp.findNext('img').get('src')
    col = each_exp.findNext("div", {"class": "display-flex flex-column full-width"})
    profile_title = col.findNext('div').findNext('span').findNext('span').text
    company_name = col.findNext('span', {"class": "t-14 t-normal"}).findNext('span').text
    timeframe = col.findAll('span', {"class": "t-14 t-normal t-black--light"})[0].findNext('span').text
    location = col.findAll('span', {"class": "t-14 t-normal t-black--light"})[1].findNext('span').text

    experiences.append({
        "company_logo": company_logo,
        "profile_title": profile_title.replace('\n', '').strip(),
        "company_name": company_name.replace('\n', '').strip(),
        "timeframe": timeframe.replace('\n', '').strip(),
        "location": location.replace('\n', '').strip(),
    })

print(experiences)

你可以像解析经验部分一样解析其他部分，比如教育、认证等。

赞(0）回复(0）举报 2023-03-23

z8dt9xmd2#

我没有答案，但在相同的情况下.你用Python做了这个，有任何运气吗？很想从LinkedIn刮我的简历到其他格式，并能够相应地调整记录.谢谢很多.

赞(0）回复(0）举报 2023-03-23