我希望得到与此命令行相同的结果:scrapy crawl linkedin_anonymous -a first=James -a last=Bond -o output.json
我的脚本如下:
import scrapy
from linkedin_anonymous_spider import LinkedInAnonymousSpider
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
spider = LinkedInAnonymousSpider(None, "James", "Bond")
process = CrawlerProcess(get_project_settings())
process.crawl(spider) ## <-------------- (1)
process.start()
我发现(1)中的process.crawl()正在创建另一个LinkedInAnonymousSpider,其中first和last都是None(打印在(2)中),如果是这样,那么没有必要创建对象蜘蛛,并且如何将参数first和last传递给process.crawl()?
LinkedIn_anonymous:
from logging import INFO
import scrapy
class LinkedInAnonymousSpider(scrapy.Spider):
name = "linkedin_anonymous"
allowed_domains = ["linkedin.com"]
start_urls = []
base_url = "https://www.linkedin.com/pub/dir/?first=%s&last=%s&search=Search"
def __init__(self, input = None, first= None, last=None):
self.input = input # source file name
self.first = first
self.last = last
def start_requests(self):
print self.first ## <------------- (2)
if self.first and self.last: # taking input from command line parameters
url = self.base_url % (self.first, self.last)
yield self.make_requests_from_url(url)
def parse(self, response): . . .
4条答案
按热度按时间w3nuxt5m1#
在
process.crawl
方法上传递spider参数:t8e9dugd2#
你可以用简单的方法来做:
pvcm50d13#
如果你有Scrapyd并且你想安排蜘蛛,那么这样做
curl http://localhost:6800/schedule.json -d project=projectname -d spider=spidername -d first='James' -d last='Bond'
ckx4rj1h4#
试试这个:
不要在
=
之间放置任何空格。