Scrapy不返回任何元素,并且在不刮擦的情况下关闭[正在关闭蜘蛛(已完成)]

8tntrjer  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(145)

下面的代码给出响应200,但在不返回所请求的数据的情况下关闭。
我明白这可能是xpath的问题,但我已经在scrapy shell中一遍又一遍地检查了它们,我认为它们是正确的。
非常相似的代码已经为我工作了很多次,我不知道我错过了什么,这一次。数据在网站的源代码中是可用的,所以它似乎不是一个动态加载的问题。
谢谢你的帮助

from folium import Link
from scrapy.item import Field
from scrapy.item import Item
from scrapy.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.loader.processors import MapCompose
from scrapy.linkextractors import LinkExtractor
from scrapy.loader import ItemLoader
from scrapy.crawler import CrawlerProcess

class Articulo(Item):
    nombre = Field()
    direccion = Field()
    telefono = Field()
    comunaregion = Field()

class SeccionAmarillaCrawler(CrawlSpider):
    name = 'scraperfunerarias'

custom_settings = {
  'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/71.0.3578.80 Chrome/71.0.3578.80 Safari/537.36'}

allowed_domains = ['paginasamarillas.com']
start_urls = ["https://www.paginasamarillas.com.ar/buscar/q/funerarias/"]

download_delay = 3 

rules = (
    Rule(
        LinkExtractor(
            allow=r'https://www.paginasamarillas.com.ar/buscar/q/funerarias/p-\d+/?tieneCobertura=true'
    ), follow=True, callback= "parseador" 
    ),
)

def parseador(self, response):
    sel = Selector(response)
    funerarias = sel.xpath('//div[contains(@class, "figBox")]')

    for funeraria in funerarias:
        item = ItemLoader(Articulo(), funeraria)
        item.add_xpath('nombre', './/span[@class="semibold"]/text()', MapCompose(lambda i: i.replace('\n','').replace('\r','').replace('\t','').strip()))
        item.add_xpath('direccion', './/span[@class="directionFig"]/text()', MapCompose(lambda i: i.replace('\n','').replace('\r','').replace('\t','').strip()))
        item.add_xpath('telefono', './/span[@itemprop="telephone"]/text()', MapCompose(lambda i: i.replace('\n','').replace('\r','').replace('\t','').strip()))
        item.add_xpath('comunaregion', './/span[@class="city"]/text()', MapCompose(lambda i: i.replace('\n','').replace('\r','').replace('\t','').strip()))

        yield item.load_item()

process = CrawlerProcess({
     'FEED_FORMAT': 'csv',
     'FEED_URI': 'datos_scrapeados.csv'
})
process.crawl(SeccionAmarillaCrawler)
process.start()

输出

2022-03-16 16:30:23 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
2022-03-16 16:30:23 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.1.0, Python 3.8.10 (default, Nov 26 2021, 20:14:08) - [GCC 9.3.0], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.1, Platform Linux-5.13.0-35-generic-x86_64-with-glibc2.29
2022-03-16 16:30:23 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-03-16 16:30:23 [scrapy.crawler] INFO: Overridden settings:
{'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, '
               'like Gecko) Ubuntu Chromium/71.0.3578.80 Chrome/71.0.3578.80 '
               'Safari/537.36'}
2022-03-16 16:30:23 [scrapy.extensions.telnet] INFO: Telnet Password: 9eb59ae51c5aae24
2022-03-16 16:30:23 [py.warnings] WARNING: /home/maka/.local/lib/python3.8/site-packages/scrapy/extensions/feedexport.py:247: ScrapyDeprecationWarning: The `FEED_URI` and `FEED_FORMAT` settings have been deprecated in favor of the `FEEDS` setting. Please see the `FEEDS` setting docs for more details
  exporter = cls(crawler)

2022-03-16 16:30:23 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2022-03-16 16:30:23 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-03-16 16:30:23 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-03-16 16:30:23 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-03-16 16:30:23 [scrapy.core.engine] INFO: Spider opened
2022-03-16 16:30:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-03-16 16:30:23 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-03-16 16:30:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.paginasamarillas.com.ar/buscar/q/funerarias/> (referer: None)
2022-03-16 16:30:25 [scrapy.core.engine] INFO: Closing spider (finished)
2022-03-16 16:30:25 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 346,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 34763,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 1.26536,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 3, 16, 15, 30, 25, 219139),
 'httpcompression/response_bytes': 367033,
 'httpcompression/response_count': 1,
 'log_count/DEBUG': 1,
 'log_count/INFO': 10,
 'log_count/WARNING': 1,
 'memusage/max': 103419904,
 'memusage/startup': 103419904,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2022, 3, 16, 15, 30, 23, 953779)}
2022-03-16 16:30:25 [scrapy.core.engine] INFO: Spider closed (finished)
pxq42qpu

pxq42qpu1#

我发现了两个问题
1.排印错误:您在allowed_domains中忘记了.ar

allowed_domains = ['paginasamarillas.com.ar']

1.字符?在正则表达式中有特殊的含义,因此必须使用\?而不是?

allow=r'https://www.paginasamarillas.com.ar/buscar/q/funerarias/p-\d+/\?tieneCobertura=true'

但你也可以使用allow=r'funerarias/p-\d+'这样简单的函数
现在你的代码对我有用了。

相关问题