Scrapy crawler无法抓取数据[已关闭]

ozxc1zmp 于 2023-05-07 发布在其他

关注(0)|答案(2)|浏览(192)

**关闭。**此题需要debugging details。目前不接受答复。

编辑问题以包含desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem。这将帮助其他人回答这个问题。
4天前关闭。
Improve this question
这段代码没有从页面中收集数据，我不知道为什么。
在VSCode中，我得到SyntaxError: 'yield' outside function，但在Jupyter notebook中没有错误。

import scrapy

class multiSpider(scrapy.Spider):
        name='multiple'
        start_url = [
            'https://forum.moshaver.co/f232/',
            'https://forum.moshaver.co/f233/',
            'https://forum.moshaver.co/f241/',
            'https://forum.moshaver.co/f231/',
        ]       
def parse(self, response):
    for data in response.css('h3.threadtitle'):
        yield {
               'title': data.css('h3.threadtitle :: text').get(),
               'answers' : data.css('threadstats td alt :: text').get(),
               'writer' : data.css ('a.username offline popupctrl :: text').get(),
               'date_time' : data.css('span.label a::text').get(),
           } 
        next_page = response.css('span.selected pageitem a::attr(href)').get()
        if next_page:
            next_page = response.urljoin('next_page')
            yield scrapy.Request (url= next_page, callback = self.parse)

scrapy

来源：https://stackoverflow.com/questions/76140829/scrapy-crawler-not-able-to-crawl-data

2条答案

按热度按时间

zte4gxcn1#

你的错误是：

start_url应该是start_urls。
1.在CSS选择器中，您应该删除::周围的空格，例如：'title': data.css('h3.threadtitle :: text').get() -〉'title': data.css('h3.threadtitle::text').get()。（这就是错误的来源，顺便说一句，你不应该在jupyter中使用scrapy）。
1.下一页的CSS选择器错误。
1.你用错了urljoin。您使用next_page作为字符串而不是变量。
1.你试图抓取的内容的选择器也是错误的。
我修正了所有问题，除了第5个问题，因为它与解决方案无关。

import scrapy

class multiSpider(scrapy.Spider):
    name = 'multiple'
    start_urls = [
        'https://forum.moshaver.co/f232/',
        'https://forum.moshaver.co/f233/',
        'https://forum.moshaver.co/f241/',
        'https://forum.moshaver.co/f231/',
    ]

    def parse(self, response):
        for data in response.css('h3.threadtitle'):
            yield {
                'title': data.css('h3.threadtitle::text').get(),
                'answers': data.css('threadstats td alt::text').get(),
                'writer': data.css ('a.username offline popupctrl::text').get(),
                'date_time': data.css('span.label a::text').get(),
            }
            next_page = response.css('span.prev_next a[rel="next"]::attr(href)').get()
            if next_page:
                next_page = response.urljoin(next_page)
                yield scrapy.Request(url=next_page)     # parse is the default callback

赞(0）回复(0）举报 2023-05-07

cetgtptt2#

他们唯一的问题是，我看到的是，在这个对象的结束括号是3个空格缩进，而不是4个空格，这可能是为什么你有问题的vscode，但你没有它在jupyter

yield {
           'title': data.css('h3.threadtitle :: text').get(),
           'answers' : data.css('threadstats td alt :: text').get(),
           'writer' : data.css ('a.username offline popupctrl :: text').get(),
           'date_time' : data.css('span.label a::text').get(),
    } 
 next_page = response.css('span.selected pageitem a::attr(href)').get()

如果您正在寻找更准确的建议，请分享完整的错误消息。此外，您似乎有两个独立的问题：
1.你的代码有一个语法问题，当你使用vscode时会看到这个问题（注意：vscode只是编辑器，实际的python解释器是当你运行脚本时在命令行中看到的）
1.你的代码没有删除任何东西，或者甚至没有启动-〉这更可能是代码本身的问题，如果没有看到你正在删除的实际页面，这很难说

赞(0）回复(0）举报 2023-05-07

我来回答

Scrapy crawler无法抓取数据[已关闭]

2条答案

相关问题

热门标签

最新问答