scrapy 提取整个HTML元素而不是以下链接时出错

7dl7o3gd  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(198)

我试图访问或遵循每个链接,出现的商业承包商从这个网站:https://lslbc.louisiana.gov/contractor-search/search-type-contractor/然后从每个链接指向的站点中提取电子邮件,但当我运行此脚本时,scrapy会在基本url后面附加整个HTML元素,而不是只在给定元素处的链接后面。
有人知道我如何才能得到想要的结果,或者我做错了什么吗?
下面是我目前拥有的代码:

from urllib import request
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    #user_agent = 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36'
    #start_urls= ['https://lslbc.louisiana.gov/contractor-search/search-type-contractor/']

    def start_requests(self):
        start_urls = [
            'https://lslbc.louisiana.gov/contractor-search/search-type-contractor/',
        ]
        #request = scrapy.Request(url=urls, callback=self.parse, method="GET", cookies=[{'domain': 'lslbc.louisiana.gov','path': '/wp-admin/admin-ajax.php?api_action=advanced&contractor_type=Commercial+License&classification=&action=api_actions'}], )
        #yield request
        for url in start_urls:
            yield scrapy.Request(url=url, callback=self.parse, cookies=[{'name': 'test', 'value': '', 'domain': 'lslbc.louisiana.gov','path': '/wp-admin/admin-ajax.php?api_action=advanced&contractor_type=Commercial+License&classification=&action=api_actions'}],)

    def parse(self, response):
        links = response.xpath('//*[@id="search-results"]/table/tbody/tr/td/a')
        for link in links:
            yield response.follow(link.get(), callback=self.parse)

    def parse_links(self, response):
        contractors = response.css()
        for contractor in contractors:
            yield {
                'name': contractor.css('').get().strip(),
                'email': contractor.css('td.[email_address]').get().strip(),
            }

它会传回:

2022-08-13 16:53:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://lslbc.louisiana.gov/contractor-search/search-type-contractor/> (referer: None)
2022-08-13 16:53:13 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://lslbc.louisiana.gov/contractor-search/search-type-contractor/%3Ca%20data-bind=%22attr:%20%7B%20href:%20$row.showURL%20%7D,%20text:%20$row.company_name%22%20target=%22_blank%22%3E%3C/a%3E> (referer: https://lslbc.louisiana.gov/contractor-search/search-type-contractor/)
2022-08-13 16:53:13 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://lslbc.louisiana.gov/contractor-search/search-type-contractor/%3Ca%20data-bind=%22attr:%20%7B%20href:%20$row.showURL%20%7D,%20text:%20$row.qualifying_party%22%20target=%22_blank%22%3E%3C/a%3E> (referer: https://lslbc.louisiana.gov/contractor-search/search-type-contractor/)
zf9nrax1

zf9nrax11#

该网页包含内置的搜索选项。每当您通过选择商业承包商进行搜索时,数据将通过API方法以json格式由JS动态加载。这就是为什么您无法从纯HTML DOM中获取所需数据的原因。

完整工作代码示例:

import scrapy
import json
class TestSpider(scrapy.Spider):
    name = 'test'

    def start_requests(self):
        headers= {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
            'x-requested-with': 'XMLHttpRequest'
        }

        url='https://lslbc.louisiana.gov/wp-admin/admin-ajax.php?api_action=advanced&contractor_type=Commercial+License&classification=&action=api_actions'
        yield scrapy.Request(
            url=url,
            headers=headers,
            callback= self.parse,
            method="GET")

    def parse(self, response):

        resp = json.loads(response.body)
        for item in resp['results']:
            api_url = 'https://lslbc.louisiana.gov/wp-admin/admin-ajax.php?action=api_actions&api_action=company_details&company_id='+item['id']

            yield scrapy.Request(
                url= api_url,
                callback= self.parse_email,
                method="GET"

                )

    def parse_email(self, response):

        resp2 = json.loads(response.body)
        yield {
            'Email':resp2['email_address']
        }

相关问题