我如何使用scrapy_splash与lua使用 chrome 引擎？

pprl5pva 于 2024-01-09 发布在其他

关注(0)|答案(1)|浏览(202)

你好，我试图使抓取机器人的网站，使用JavaScript.我有大约20从该网站的网址，并希望扩大到houncers，我需要的网址被刮得很频繁，所以我尝试使用lua脚本做“动态”等待时间.当我使用默认的webkit引擎，该网站的html输出只是文本说，该网站不支持此浏览器，这就是为什么我使用chromium引擎。没有lua脚本，scraping只提供chromium引擎的输出项，但它确实有效。在我尝试使用lua之后，我使用chromium引擎得到了错误，使用webkit它执行时没有错误，但没有给予任何输出项。这是我使用lua的开始请求：

def start_requests(self):
        lua_script = """
        function main(splash, args)
            assert(splash:go(args.url))

            while not splash:select('div.o-matchRow')
                splash:wait(1)
                print('waiting...')
            end
            return {html=splash:html()}
        end    
        """

        for url in self.start_urls:
            yield SplashRequest(
                url=url,
                callback=self.parse,
                endpoint='execute',
                args={'engine': 'chromium', 'lua_source': lua_script}
            )

字符串
这是一些简单的东西，我想测试出来。有人知道什么是处理lua和 chrome 引擎，或者我怎么能使用webkit时，网站不支持它？（顺便说一句，对不起我的英语，我不是一个母语）这些是与 chrome 引擎的错误：

2023-12-04 21:23:54 [scrapy.utils.log] INFO: Scrapy 2.11.0 started (bot: tipsport_scraper)
2023-12-04 21:23:54 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.11.4 (tags/v3.11.4:d2340ef, Jun  7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)
], pyOpenSSL 23.3.0 (OpenSSL 3.1.4 24 Oct 2023), cryptography 41.0.5, Platform Windows-10-10.0.19045-SP0
2023-12-04 21:23:54 [scrapy.addons] INFO: Enabled addons:                                                               
[]                                                                                                                      
2023-12-04 21:23:54 [asyncio] DEBUG: Using selector: SelectSelector                                                     
2023-12-04 21:23:54 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor     
2023-12-04 21:23:54 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2023-12-04 21:23:54 [scrapy.extensions.telnet] INFO: Telnet Password: **************
2023-12-04 21:23:54 [py.warnings] WARNING: C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy\extensions\feedexport.py:406: ScrapyDeprecationWarning: The `FEED_URI` and `FEED_FORMAT` settings have been
 deprecated in favor of the `FEEDS` setting. Please see the `FEEDS` setting docs for more details
  exporter = cls(crawler)

2023-12-04 21:23:54 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2023-12-04 21:23:54 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'tipsport_scraper',
 'CONCURRENT_REQUESTS': 5,
 'DOWNLOAD_DELAY': 5,
 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
 'FEED_EXPORT_ENCODING': 'utf-8',
 'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage',
 'NEWSPIDER_MODULE': 'tipsport_scraper.spiders',
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'SPIDER_MODULES': ['tipsport_scraper.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor',
 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
               '(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
2023-12-04 21:23:54 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy_splash.SplashCookiesMiddleware',
 'scrapy_splash.SplashMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2023-12-04 21:23:54 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy_splash.SplashDeduplicateArgsMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2023-12-04 21:23:54 [scrapy.middleware] INFO: Enabled item pipelines:
['tipsport_scraper.pipelines.TipsportScraperPipeline']
2023-12-04 21:23:54 [scrapy.core.engine] INFO: Spider opened
2023-12-04 21:23:54 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-12-04 21:23:54 [scrapy.extensions.telnet] INFO: Telnet console listening on **********
2023-12-04 21:23:54 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.tipsport.cz/kurzy/fotbal-16?limit=1000 via http://localhost:8050/execute>
Traceback (most recent call last):
  File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\twisted\internet\defer.py", line 1697, in _inlineCallbacks
    result = context.run(gen.send, result)
  File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy\core\downloader\middleware.py", line 68, in process_response
    method(request=request, response=response, spider=spider)
  File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy_splash\middleware.py", line 412, in process_response
    response = self._change_response_class(request, response)
  File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy_splash\middleware.py", line 433, in _change_response_class
    response = response.replace(cls=respcls, request=request)
  File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy\http\response\__init__.py", line 125, in replace
    return cls(*args, **kwargs)
  File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy_splash\response.py", line 120, in __init__
    self._load_from_json()
  File "C:\Users\Kryštof\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy_splash\response.py", line 174, in _load_from_json
    error = self.data['info']['error']
TypeError: string indices must be integers, not 'str'
2023-12-04 21:23:54 [scrapy.core.engine] INFO: Closing spider (finished)
2023-12-04 21:23:54 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1045,
 'downloader/request_count': 1,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 255,
 'downloader/response_count': 1,
 'downloader/response_status_count/400': 1,
 'elapsed_time_seconds': 0.233518,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2023, 12, 4, 20, 23, 54, 847285, tzinfo=datetime.timezone.utc),
 'log_count/DEBUG': 3,
 'log_count/ERROR': 1,
 'log_count/INFO': 10,
 'log_count/WARNING': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'splash/execute/request_count': 1,
 'splash/execute/response_count/400': 1,
 'start_time': datetime.datetime(2023, 12, 4, 20, 23, 54, 613767, tzinfo=datetime.timezone.utc)}
2023-12-04 21:23:54 [scrapy.core.engine] INFO: Spider closed (finished)

型
我删除了telenet密码和某种IP，以防万一，如果它是敏感的东西，我用 * 替换它们。

scrapy

来源：https://stackoverflow.com/questions/77602404/how-do-i-use-scrapy-splash-with-lua-using-chromium-engine

1条答案

按热度按时间

hrirmatl1#

对于Chromium，请确保正确设置了Splash以处理Chromium请求。如果它仍然不起作用，更新Splash可能会有所帮助。
对于WebKit，网站似乎会阻止它，因此尝试将Scrapy中的用户代理更改为更常见的代理。此外，检查您在Lua脚本中等待的div.o-matchRow是否确实存在于网站上。如果确实存在，并且您仍然有问题，请尝试设置脚本等待的时间限制，以避免卡住。
日志中的TypeError表明脚本处理响应的方式存在问题。请确保在脚本中正确处理数据格式。

赞(0）回复(0）举报 2024-01-09

我来回答

我如何使用scrapy_splash与lua使用 chrome 引擎？

1条答案

相关问题

热门标签

最新问答