伙计们,我需要做一个机器人,将收集一些信息的链接,它将首先去这个链接https://www.mediktor.com/pt-br/glossario,并获得链接到所有的疾病,然后去一个接一个,并获得他们的信息,如描述,流行病学,症状等。我想出了这个代码,它可以获得链接,但它不返回任何东西,当我试图访问他们,并获得信息
import scrapy
import time
import requests
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import json
class DicionarioSpider(scrapy.Spider):
name = 'dicionario'
allowed_domains = ['www.mediktor.com']
start_urls = ['http://www.mediktor.com/']
def start_requests(self):
url = "https://www.mediktor.com/pt-br/glossario"
options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)
driver.get(url)
time.sleep(10)
doencas = driver.find_elements(
By.XPATH, "//a[@class='mdk-dictionary-list__glossary-item']")
for doenca in doencas:
url = doenca.get_attribute('href')
yield scrapy.Request(url)
driver.quit()
def parse(self, response):
urls = response.css(
'.mdk-dictionary-list__glossary-item a::attr(href)')
for url in urls:
yield response.follow(url.get(), callback=self.parse_info)
def parse_info(self, response):
contents = response.css('div.page-glossary-detail__main-content')
for desc in response.css('div.mdk-conclusion-detail__main-description'):
desc = response.css('p ::text').getall()
yield {
'desc': desc
}
for content in contents:
yield{
'name': content.css(
'div.mdk-conclusion-detail__main-title ::text').get().strip(),
'espec': content.css(
'div.mdk-ui-list-item__text mdc-list-item__text span::text').strip()
}
1条答案
按热度按时间4sup72z81#
数据是动态加载的,你可以得到所有可能的ID,然后一个接一个地遍历并得到它们的信息。在我的例子中,从列表中得到第一个ID,然后得到一个巨大的JSON,从中你将得到所有必要的数据。由于字符限制,我不得不剪掉它的一部分
输出: