我建立了一个刮刀从一个网站上刮,刮刀工作完全正常。我曾经将数据存储到两个单独的JSON文件中,分别是raw_data. json和cleaned_data. json。但我目前正试图将刮刀与我公司的框架合并,在框架进程中没有本地存储。所以我尝试使用内存中的数据结构导出数据,该结构传递raw_data和cleaned_数据变量的下一步行动.我已经取得了刮刀一个包.所以导入工作良好,但当我测试运行它,输出只是空.我在这个问题上卡住了几天了,所以有办法吗?
我想做的就是把这两个变量raw_data和cleaned_data导出到run_spider.py中,当它们完成的时候,使用调度器?
这是我的一部分pipelines.py
class RawDataPipeline:
def __init__(self):
self.raw_data = []
def process_item(self, item, spider):
# Basic data validation: Check if the scraped item is not empty
adapter = ItemAdapter(item)
if adapter.get('project_source'):
self.raw_data.append(adapter.asdict())
return item
def close_spider(self, spider):
"""
with open('raw_data.json', 'w',encoding='utf-8') as file:
json.dump(self.raw_data, file, indent=2, ensure_ascii=False)
"""
spider.crawler.signals.send_catch_log(signal=spider.custom_close_signal, raw_data=self.raw_data)
return self.raw_data
class CleanedDataPipeline:
def __init__(self):
self.cleaned_data = []
self.list_dic = {}
def process_item(self, item, spider):
cleaned_item = self.clean_item(item)
self.cleaned_data.append(cleaned_item)
return item
def close_spider(self, spider):
# Convert values to list for keys in list_dic
for key in self.list_dic:
for cleaned_item in self.cleaned_data:
self.convert_to_list(cleaned_item, key)
#with open('cleaned_data.json', 'w', encoding='utf-8') as file:
# json.dump(self.cleaned_data, file, indent=2, ensure_ascii=False)
# Log list_dic
#spider.log("List_dic: %s" % json.dumps(self.list_dic, indent=2, ensure_ascii=False))
spider.crawler.signals.send_catch_log(signal=spider.custom_close_signal, cleaned_data=self.cleaned_data)
return self.cleaned_data
字符串
这是python脚本,我在这里启动spider,并尝试在spider关闭时获取数据
def spider_closed(signal, sender, **kwargs):
# Access the data after the spider is closed
raw_data = RawDataPipeline().raw_data
cleaned_data = CleanedDataPipeline().cleaned_data
print("Raw Data:", raw_data)
print("Cleaned Data:", cleaned_data)
def run_spider():
# Create a CrawlerProcess
process = CrawlerProcess(settings)
# Connect the spider_closed signal to your callback function
dispatcher.connect(spider_closed, signal=signals.spider_closed)
# Add your spider to the process
process.crawl(NieuwbouwspiderSpider)
# Start the crawling process
process.start()
run_spider()
型
这不起作用,打印的结果只是空的。有解决方案吗?
1条答案
按热度按时间bqucvtff1#
我使用dispatcher.send和singnals.close_spider找到了解决方案。