I have a csv file containing about 900 TikTok user urls and would like to scrape the info which I have achieved but since Scrapy is single threaded I'm trying to divide the process into at least 20 concurrent processes with Scrapy and concurrent.futures
with different parameters (Process no 1 scrapes users by index from the csv file from 1-20, process no2 scrapes users from 20-40, etc... ).
Here the GetTikTokFrontPageHTMLSpider
would be called from crawl()
. This only works with the traditional calling of crawl()
as a normal function followed by reactor.run()
but not with ProcessPoolExecutor()
. When I run with method 1 (please see def run_concurrency()
in the code) I get the output:
Finished in 0.0 second(s)
[[0, 2], [2, 4], [4, 6], [6, 8], [8, 10], [10, 12], [12, 14], [14, 16], [16, 18], [18, 20], [20, 20]]
Finished in 0.0 second(s)
Finished in 0.0 second(s)
And when run with methods 2 I get:
Finished in 0.0 second(s)
[[0, 2], [2, 4], [4, 6], [6, 8], [8, 10], [10, 12], [12, 14], [14, 16], [16, 18], [18, 20], [20, 20]]
[0, 2]
[2, 4]
[4, 6]
[6, 8]
[8, 10]
[10, 12]
[12, 14]
[14, 16]
[16, 18]
[18, 20]
[20, 20]
Finished in 0.0 second(s)
Finished in 0.0 second(s)
Finished in 0.0 second(s)
Finished in 0.0 second(s)
Finished in 0.0 second(s)
Finished in 0.0 second(s)
Finished in 0.0 second(s)
Finished in 0.0 second(s)
The process seems to go through every executor.submit enter code here
(not sure for the map
method as it but doesn't seem to be working as intended as the spider GetTikTokFrontPageHTML
never gets executed even once. The full methods for def run_concurrency
:
def run_concurrency Method 1
def run_concurrency():
with concurrent.futures.ProcessPoolExecutor() as executor:
user_idx = get_user_idx(2)
print(user_idx)
executor.map(crawl, user_idx) #Method 1 (With map)
def run_concurrency Method 2
def run_concurrency():
with concurrent.futures.ProcessPoolExecutor() as executor:
user_idx = get_user_idx(2)
print(user_idx)
for idx in user_idx: #Method 2 (With submit)
print(idx)
executor.submit(crawl, idx)
And here's the full codes (I only include the needed part for GetTikTokFrontPageHTMLSpider
for this particular question):
main.py
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
import concurrent.futures
import math
import time
from TikTokUser import get_users_count
from ScrapeTikTok.spiders.GetTikTokFrontPageHTMLSpider import GettiktokfrontpageHTMLSpider
configure_logging()
settings = get_project_settings()
runner = CrawlerRunner(settings)
@defer.inlineCallbacks
def crawl(user_idx):
yield runner.crawl(GettiktokfrontpageHTMLSpider, user_start=user_idx[0], user_end=user_idx[1])
reactor.stop()
def get_user_idx(batch_count):
user_idx = []
users_count = 20 # get_users_count() #Total number of users to be scraped
whole_batch_count = math.floor(users_count / batch_count)
for i in range(0, batch_count * whole_batch_count, batch_count):
append_values = user_idx.append([i, i + batch_count])
last_values = [batch_count * whole_batch_count, users_count]
append_last_values = user_idx.append(last_values)
return user_idx
def run_concurrency():
with concurrent.futures.ProcessPoolExecutor() as executor:
user_idx = get_user_idx(2)
print(user_idx)
executor.map(crawl, user_idx[0], user_idx[1]) #Method 1 (With map)
# for idx in user_idx: #Method 2 (With submit)
# print(idx)
# executor.submit(crawl, idx)
# crawl(0,1) #If these two are un-commented it goes through to start_requests in GetTikTokFrontPageHTMLSpider.py
# reactor.run()
if __name__ == '__main__':
run_concurrency()
reactor.run()
GetTikTokFrontPageHTMLSpider.py
import scrapy
import requests
from TikTokUser import get_user_urls
class GettiktokfrontpageHTMLSpider(scrapy.Spider):
name = 'GetTikTokFrontPageHTMLSpider'
allowed_domains = ['smartproxy.com']
def __init__(self, user_start=None, user_end=None):
self.user_start = user_start
self.user_end = user_end
def start_requests(self):
print("START REQUEST")
user_urls = get_user_urls()
if self.user_end > len(user_urls):
self.user_end = len(user_urls)
for user_url in user_urls[self.user_start:self.user_end]:
yield self.parse(user_url)
def parse(self, user_url):
...
How do I write main.py
so that crawl()
will be called to create multiple processes (in this test case of batch_count=2 where 2 users processed per process, so 11 processes for 20 users) with different parameters for each process being passed to crawl and hence the spider?
1条答案
按热度按时间enxuqcxy1#
I would try reading in the urls prior to starting the concurrent processes and feed a list of urls to each process that way they are not all trying to extract data from the same file. I would also use
scrapy.CrawlerProcess
for each process, and leave as much of the scrapy logic as possible inside of the GetTikTokFrontPageHTMLSpider.py module.例如 :
GetTikTokFrontPageHTMLSpider.py
中 的 每 一 个
然后 在 您 的
main.py
中 :格式