使用Scrapy时绕过403错误

3lxsmp7m  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(209)

我正在遵循scrapy教程here,我正试图将它与我自己的项目链接起来。
我首先通过运行以下命令创建一个项目:

scrapy startproject idealistaScraper

接下来,我转到spiders文件夹,并使用以下代码创建一个新的python文件:

import scrapy

print("\n", "-"*145, "\n", "-"*60, "Starting the Scrapy bot", "-"*60, "\n", "-"*145, "\n")
class QuotesSpider(scrapy.Spider):
    name = "idealistaCollector"

    def start_requests(self):
        urls = [
            'https://www.idealista.com/inmueble/97010777/'
            #'https://www.idealista.com/inmueble/97010777/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f'quotes-{page}.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log(f'Saved file {filename}')

另存为:connection_spider.py .
最后,我运行以下代码

scrapy crawl idealistaCollector

其中,idealistaCollector是我在connection_spider.py文件中为刮刀指定的name
我得到的输出如下:

------------------------------------------------------------------------------------------------------------------------------------------------- 
 ------------------------------------------------------------ Starting the Scrapy bot ------------------------------------------------------------ 
 ------------------------------------------------------------------------------------------------------------------------------------------------- 

2022-04-08 18:42:51 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: idealistaScraper)
2022-04-08 18:42:51 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.2.0, Python 3.9.12 | packaged by conda-forge | (main, Mar 24 2022, 23:25:59) - [GCC 10.3.0], pyOpenSSL 22.0.0 (OpenSSL 1.1.1n  15 Mar 2022), cryptography 36.0.2, Platform Linux-5.3.18-150300.59.49-default-x86_64-with-glibc2.31
2022-04-08 18:42:51 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'idealistaScraper',
 'NEWSPIDER_MODULE': 'idealistaScraper.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['idealistaScraper.spiders']}
2022-04-08 18:42:51 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-04-08 18:42:51 [scrapy.extensions.telnet] INFO: Telnet Password: 3ca0ebf8976d6291
2022-04-08 18:42:51 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2022-04-08 18:42:51 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-04-08 18:42:51 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-04-08 18:42:51 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-04-08 18:42:51 [scrapy.core.engine] INFO: Spider opened
2022-04-08 18:42:51 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-04-08 18:42:51 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-04-08 18:42:52 [filelock] DEBUG: Attempting to acquire lock 140245041823504 on /home/bscuser/.cache/python-tldextract/3.9.12.final__miniconda3__36b6b0__tldextract-3.2.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-04-08 18:42:52 [filelock] DEBUG: Lock 140245041823504 acquired on /home/bscuser/.cache/python-tldextract/3.9.12.final__miniconda3__36b6b0__tldextract-3.2.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-04-08 18:42:52 [filelock] DEBUG: Attempting to acquire lock 140245032917552 on /home/bscuser/.cache/python-tldextract/3.9.12.final__miniconda3__36b6b0__tldextract-3.2.0/urls/62bf135d1c2f3d4db4228b9ecaf507a2.tldextract.json.lock
2022-04-08 18:42:52 [filelock] DEBUG: Lock 140245032917552 acquired on /home/bscuser/.cache/python-tldextract/3.9.12.final__miniconda3__36b6b0__tldextract-3.2.0/urls/62bf135d1c2f3d4db4228b9ecaf507a2.tldextract.json.lock
2022-04-08 18:42:52 [filelock] DEBUG: Attempting to release lock 140245032917552 on /home/bscuser/.cache/python-tldextract/3.9.12.final__miniconda3__36b6b0__tldextract-3.2.0/urls/62bf135d1c2f3d4db4228b9ecaf507a2.tldextract.json.lock
2022-04-08 18:42:52 [filelock] DEBUG: Lock 140245032917552 released on /home/bscuser/.cache/python-tldextract/3.9.12.final__miniconda3__36b6b0__tldextract-3.2.0/urls/62bf135d1c2f3d4db4228b9ecaf507a2.tldextract.json.lock
2022-04-08 18:42:52 [filelock] DEBUG: Attempting to release lock 140245041823504 on /home/bscuser/.cache/python-tldextract/3.9.12.final__miniconda3__36b6b0__tldextract-3.2.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-04-08 18:42:52 [filelock] DEBUG: Lock 140245041823504 released on /home/bscuser/.cache/python-tldextract/3.9.12.final__miniconda3__36b6b0__tldextract-3.2.0/publicsuffix.org-tlds/de84b5ca2167d4c83e38fb162f2e8738.tldextract.json.lock
2022-04-08 18:42:52 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.idealista.com/robots.txt> (referer: None)
2022-04-08 18:42:52 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.idealista.com/inmueble/97010777/> (referer: None)
2022-04-08 18:42:52 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.idealista.com/inmueble/97010777/>: HTTP status code is not handled or not allowed
2022-04-08 18:42:52 [scrapy.core.engine] INFO: Closing spider (finished)
2022-04-08 18:42:52 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 617,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 2461,
 'downloader/response_count': 2,
 'downloader/response_status_count/403': 2,
 'elapsed_time_seconds': 0.353292,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 4, 8, 16, 42, 52, 274906),
 'httperror/response_ignored_count': 1,
 'httperror/response_ignored_status_count/403': 1,
 'log_count/DEBUG': 11,
 'log_count/INFO': 11,
 'memusage/max': 68902912,
 'memusage/startup': 68902912,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/403': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2022, 4, 8, 16, 42, 51, 921614)}
2022-04-08 18:42:52 [scrapy.core.engine] INFO: Spider closed (finished)

因此,我的问题是,我如何导航我得到的403错误?

2022-04-08 18:42:52 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.idealista.com/robots.txt> (referer: None)
2022-04-08 18:42:52 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.idealista.com/inmueble/97010777/> (referer: None)

我也尝试过将以下自定义头文件添加到connection_spider.py文件中,但仍然没有任何运气。


#### own defined functions ###

    desktop_agents = {"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36", 
                    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36", 
                    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36", 
                    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14", 
                    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36", 
                    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36", 
                    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36", 
                    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36", 
                    "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36", 
                    "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0"
                    }

    userAGENT = sample(desktop_agents, 1)

    headers = {
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'accept-encoding': 'gzip, deflate, br',
        'accept-language': 'es-ES,es;q=0.9,en;q=0.8',
        'cache-control': 'max-age=0',
        'referer': 'https://www.idealista.com/en/',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-user': '?1',
        'upgrade-insecure-requests': '1',
        'user-agent': userAGENT
    }
    print("-"*30, "Using User Agent ", userAGENT)

    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1, # sensitivity to collect data - 1 request per domain
        'DOWNLOAD_DELAY': 10 # 1 second download delay
    }

编辑:
此外,当我运行:

wget https://www.idealista.com/buscar/venta-oficinas/240/
--2022-04-08 17:30:01--  https://www.idealista.com/buscar/venta-oficinas/240/
Resolviendo www.idealista.com (www.idealista.com)... 151.101.18.137
Conectando con www.idealista.com (www.idealista.com)[151.101.18.137]:443... conectado.
Petición HTTP enviada, esperando respuesta... 403 Forbidden
2022-04-08 17:30:01 ERROR 403: Forbidden.
rfbsl7qr

rfbsl7qr1#

我还得到了403使用scrapy的情况下,两个网址:这里和here,但当我使用python requests模块,然后它的工作意味着响应状态:200
下面是一个例子,你可以检验一下:

from bs4 import BeautifulSoup
import requests

url = 'https://www.idealista.com/venta-viviendas/barcelona/sant-marti/el-parc-i-la-llacuna-del-poblenou/pagina-3.htm'

# url='https://www.idealista.com/inmueble/97010777/'

r = requests.get(url)

# print(r)

soup = BeautifulSoup(r.text, 'html.parser')
for url in soup.select('.item-info-container >a'):
    abs_url='https://www.idealista.com'+ url.get('href')
    print(abs_url)

输出量:

https://www.idealista.com/inmueble/97265546/
https://www.idealista.com/inmueble/96143763/
https://www.idealista.com/inmueble/95881655/
https://www.idealista.com/inmueble/97242873/
https://www.idealista.com/inmueble/95933278/
https://www.idealista.com/inmueble/93040808/
https://www.idealista.com/inmueble/97219129/
https://www.idealista.com/inmueble/96348689/
https://www.idealista.com/inmueble/96348679/
https://www.idealista.com/inmueble/96348658/
https://www.idealista.com/inmueble/96348663/
https://www.idealista.com/inmueble/94336217/
https://www.idealista.com/inmueble/96348506/
https://www.idealista.com/inmueble/96348546/
https://www.idealista.com/inmueble/95839055/
https://www.idealista.com/inmueble/96348623/
https://www.idealista.com/inmueble/97202829/
https://www.idealista.com/inmueble/96154622/
https://www.idealista.com/inmueble/96069543/
https://www.idealista.com/inmueble/95776046/
https://www.idealista.com/inmueble/94933084/
https://www.idealista.com/inmueble/95776049/
https://www.idealista.com/inmueble/95776021/
https://www.idealista.com/inmueble/97277519/
https://www.idealista.com/inmueble/96933133/
https://www.idealista.com/inmueble/96287437/
https://www.idealista.com/inmueble/97272643/
https://www.idealista.com/inmueble/90782924/
https://www.idealista.com/inmueble/96151505/
https://www.idealista.com/inmueble/97136190/

相关问题