scrapy 更正使用 AJAX 抓取网站的标题和有效负载

gkn4icbw  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(125)

我正在尝试用scrapy FormRequest模拟 AJAX 请求,以获取此网站的下一页https://www.the-academy.nl/trainingen

headers = {
        'path': 'https://www.the-academy.nl/Page?$$ajaxid=view:_id1:_id2:_id3:_id4:_id5:6:_id114:_id116:tblView',
        'authority': 'www.the-academy.nl',
        'accept-encoding': 'gzip, deflate, br',
        'content-length': '1225',
        'content-type': 'multipart/form-data'
    }

形成这样数据

formdata = {
        '$$viewid': '!1rjej6ewgse3x0h6r86gfzlst!',
        '$$xspsubmitid': 'view:_id1:_id2:_id3:_id4:_id5:6:_id114:_id116:viewPager__Group__lnk__1',
        '$$xspexecid': 'view:_id1:_id2:_id3:_id4:_id5:6:_id114:_id116:viewPager',
        '$$xspsubmitvalue':'',
        '$$xspsubmitscroll': '0|1272',
    }

我得到了回应,但它的404页。提前感谢)

cetgtptt

cetgtptt1#

1.我使用java作为搜索词,只选择那些具有键-值对的表单数据。
1.不插入'content-length'标头
1.添加方法:“POST”
1.呼叫FormRequest.from_response
1.以下是200响应状态的示例

脚本:

from scrapy.crawler import CrawlerProcess
import scrapy
class AspSpider(scrapy.Spider):
    name = 'asp'

    def start_requests(self):
        yield scrapy.FormRequest(

            url='https://www.the-academy.nl/zoekresultatenpagina?text=java',
            formdata= {
                'view:_id1:_id2:_id3:_id4:_id5:2:_id86:_id88:query': "",
                'view:_id1:_id2:_id3:_id4:_id5:3:_id94:_id96:query': "",
                '$viewid': '!eaie1cfxpuckx0dbjrxsxrw60!',
                '$$xspsubmitid': 'view:_id1:_id2:_id3:_id199:_id200:0:_id201:_id203:viewPager__Next',
                '$$xspexecid': 'view:_id1:_id2:_id3:_id199:_id200:0:_id201:_id203:viewPager',
                '$$xspsubmitscroll': '0|1500',
                'view:_id1': 'view:_id1',
                '$$xspsubmitvalue': ""
                },
            callback=self.parse_item,
            headers={
                'accept': '*/*',
                'accept-encoding': 'gzip, deflate, br',
                'accept-language': 'en-US,en;q=0.9',
                'content-type': 'multipart/form-data; boundary=----WebKitFormBoundary2aCYMIdAcbwx4FjO',
                'referer': 'https://www.the-academy.nl/zoekresultatenpagina?text=java'
            },
            method='POST'

            )
    def parse_item(self,response):
        pass
if __name__ == "__main__":
    process =CrawlerProcess(AspSpider)
    process.crawl()
    process.start()

输出:

DEBUG: Crawled (200) <POST https://www3.hkexnews.hk/sdw/search/searchsdw.aspx> (referer: https://www.the-academy.nl/zoekresultatenpagina?text=java)

相关问题