我以前做过一个小项目,用BeautifulSoup抓取一个真实的地产网站,但是抓取大约5,000个数据点花了很长时间。我想学习多线程处理,然后用BS实现它,但是有人告诉我,用Scrapy抓取网页可能更快更容易。另外,我已经从使用Spyder切换到Pycharm作为我的IDE。它仍然是不和谐的经验,但我正在努力适应它。
我已经看过一次文档,并遵循了一些使用Scrapy抓取的例子,但我仍然遇到困难。我计划使用我以前创建的BS抓取脚本作为基础,并创建一个新的Scrapy项目来网页抓取真实的地产数据。但是,我不知道如何以及从哪里开始。任何和所有的帮助都非常感谢。谢谢。
**想要的结果:**使用Scrapy从多个URL抓取多个页面。输入公寓清单链接并从每个链接取得数据,即可抓取多个值。
Scrapy脚本(目前为止):
# -*- coding: utf-8 -*-
# Import library
import scrapy
# Create Spider class
class UneguiApartmentSpider(scrapy.Spider):
name = 'apartments'
allowed_domains = ['www.unegui.mn']
start_urls = [
'https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/'
]
# headers
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36"
}
def parse(self, response):
for listings in response.xpath("//div[@class='list-announcement']"):
item = ApartmentsItem()
item['name'] = listings.xpath('text()').extract()
item['link'] = listings.xpath('href').extract()
yield item
靓汤剧本:
这个脚本仍然有一些问题,我试图解决,如刮城市和价格。例如,为4卧室公寓url(/4-r/),它创建一个错误或空值,因为有VIP列表
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup as BS
from datetime import datetime, timedelta
from dateutil.relativedelta import relativedelta
from timeit import default_timer as timer
import pandas as pd
import re
import csv
dt_today = datetime.today()
date_today = dt_today.strftime('%Y-%m-%d')
date_today2 = dt_today.strftime('%Y%m%d')
date_yesterday = (dt_today-relativedelta(day=1)).strftime('%Y-%m-%d')
def main():
page = 0
name = []
date = []
address = []
district = []
city = []
price = []
area_sqm = []
rooms = []
floor = []
commission_year = []
building_floors = []
garage = []
balcony = []
windows = []
window_type = []
floor_type = []
door_type = []
leasing = []
description = []
link = []
for i in range (5,6):
BASE = 'https://www.unegui.mn'
URL = f'{BASE}/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/{i}-r/?page='
COLUMNS=['Name','Date','Address','District','City','Price','Area_sqm','Rooms','Floor','Commission_year',
'Building_floors','Garage', 'Balcony','Windows','Window_type','Floor_type','door_type','Leasing','Description','Link']
with requests.Session() as session:
while True:
(r := session.get(f'{URL}{page+1}')).raise_for_status()
m = re.search('.*page=(\d+)$', r.url)
if m and int(m.group(1)) == page:
break
page += 1
start = timer()
print(f'Scraping {i} bedroom apartments page {page}')
soup = BS(r.text, 'lxml')
for tag in soup.findAll('div', class_='list-announcement-block'):
_name = tag.find('a', attrs={'itemprop': 'name'})
name.append(_name.get('content', 'N/A'))
if (_link := _name.get('href', None)):
link.append(f'{BASE}{_link}')
(_r := session.get(link[-1])).raise_for_status()
_spanlist = BS(_r.text, 'lxml').find_all('span', class_='value-chars')
floor_type.append(_spanlist[0].get_text().strip())
balcony.append(_spanlist[1].get_text().strip())
garage.append(_spanlist[2].get_text().strip())
window_type.append(_spanlist[3].get_text().strip())
door_type.append(_spanlist[4].get_text().strip())
windows.append(_spanlist[5].get_text().strip())
_alist = BS(_r.text, 'lxml').find_all('a', class_='value-chars')
commission_year.append(_alist[0].get_text().strip())
building_floors.append(_alist[1].get_text().strip())
area_sqm.append(_alist[2].get_text().strip())
floor.append(_alist[3].get_text().strip())
leasing.append(_alist[4].get_text().strip())
district.append(_alist[5].get_text().strip())
address.append(_alist[6].get_text().strip())
rooms.append(tag.find('div', attrs={'announcement-block__breadcrumbs'}).get_text().split('»')[1].strip())
description.append(tag.find('div', class_='announcement-block__description').get_text().strip())
date.append(tag.find('div', class_='announcement-block__date').get_text().split(',')[0].strip())
city.append(tag.find('div', class_='announcement-block__date').get_text().split(',')[1].strip())
# if ( _price := tag.find('div', class_='announcement-block__price _premium')) is None:
# _price = tag.find('meta', attrs={'itemprop': 'price'})['content']
# price.append(_price)
end = timer()
print(timedelta(seconds=end-start))
df = pd.DataFrame(zip(name, date, address, district, city,
price, area_sqm, rooms, floor, commission_year,
building_floors, garage, balcony, windows, window_type,
floor_type, door_type, leasing, description, link), columns=COLUMNS)
return(df)
df['Date'] = df['Date'].replace('Өнөөдөр', date_today)
df['Date'] = df['Date'].replace('Өчигдөр', date_yesterday)
df['Area_sqm'] = df['Area_sqm'].replace('м²', '')
df['Balcony'] = df['Balcony'].replace('тагттай', '')
if __name__ == '__main__':
df = main()
df.to_csv(f'{date_today2}HPD.csv', index=False)
2条答案
按热度按时间pgccezyw1#
这是一个抓取同一网站的多个URL的示例,例如,该网站是amazon,第一个URL用于婴儿类别,第二个URL用于另一个类别
如果要对每个URL执行不同的处理,则应使用
ekqde3dh2#
Scrapy是一个异步回调驱动的框架。
parse()
方法是所有start_urls
的默认回调函数。现在,每个回调函数都可以产生以下两种结果之一:scrapy.Request
对象以继续进行擦除。因此,如果你有多个页面scraper,你想刮所有的项目,你的逻辑看起来像这样:
在这里,蜘蛛将请求第一个页面,然后调度请求所有其余的页面并发-这意味着你可以充分利用scrapy的速度,以获得所有的清单。