Python中的Scrapy-保存在parse函数中-线程安全?

zc0qhyus  于 2023-04-21  发布在  Python
关注(0)|答案(1)|浏览(202)

我使用scrapy下载页面。我想将所有下载的页面保存在一个文件中。我有以下代码用于构造函数和解析:

def __init__(self):
        self.time = time_utils.get_current_time_hr()
        self.folder = f"{ROOT_DIR}/data/tickers/scrapy/{self.time}/"
        os.makedirs(self.folder, exist_ok=True)
        filename = self.folder + "bigfile.txt"
        self.f = open(filename, 'w')

    def parse(self, response):
        buffer = list()
        buffer.append(response.body.decode("utf-8") )
        self.f.write("".join(buffer))
        self.f.flush()

在我写的big_file. txt文件中,是否有可能混合使用不同的html页面?

x6yk4ghg

x6yk4ghg1#

Scrapy是单线程的,但即使数据不会被破坏,它仍然是一个坏主意,因为写入文件是一个阻塞操作。
您可以使用FEEDS并让它为您处理。
试试这个例子,看看它是否适合你的需要:

main.py:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

if __name__ == "__main__":
    spider = 'example_spider'
    settings = get_project_settings()
    settings['USER_AGENT'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
    settings['FEEDS'] = {
        'bigfile.csv': {
            'format': 'csv',
            # if you use 'overwrite' = True, it will overwrite the file with each item it yields.
        }
    }
    process = CrawlerProcess(settings)
    process.crawl(spider, start_urls=['https://scrapingclub.com/exercise/list_basic/?page=2'])
    process.crawl(spider, start_urls=['https://scrapingclub.com/exercise/list_basic/?page=4'])
    process.start()

spider.py:

import scrapy

class ProductSpider(scrapy.Spider):
    name = "example_spider"

    def parse(self, response, **kwargs):
        yield {response.url: response.body.decode("utf-8")}

相关问题