从票房网站抓取多页
下面是我认为错误的代码:
page_number=200
while True:try:
url = (f'https://www.boxofficemojo.com/chart/ww_top_lifetime_gross/?area=XWW200&offset={page_number}'
response =requests.get(url)
response = response.content
soup = bs(response,'html.parser')
if len(page_number) == 600:
print("finished")
break
movies_tank = soup.find("tbody")
detail = soup.find_all("tr")
All_movies = []
for de in detail :
title = de.find("td",{"class":"a-text-left mojo-field-type-title"})
Worldwide_gross= de.find("td",{"class":"a-text-right mojo-field-type-money"})
year =de.find("td",{"class":"a-text-left mojo-field-type-year"})
# if title is not None:
if title is not None:
title= (title.text)
if Worldwide_gross is not None:
Worldwide_gross=(Worldwide_gross.text.strip('$'))
if year is not None:
year= (year).text
All_movies.append([title,Worldwide_gross,year])
page_number =+ 200
print(url)
except:
break
尝试抓取多个页面,但只得到一个页面,我不知道while循环中的错误在哪里
2条答案
按热度按时间kkbh8khc1#
考虑到你的bs 4版本大于3. 8,我写了这个答案。它依赖于脚本中的伪CSS选择器,bs 4仅支持版本>=3.8的功能。
iovurdzv2#
你对页数和增量有问题,
更改:
len(page_number) == 600
到page_number == 600
并更改:
page_number =+ 200
至page_number += 1
完整代码: