python 网页抓取无法获取多个页面

qgzx9mmu  于 2023-05-21  发布在  Python
关注(0)|答案(2)|浏览(163)

从票房网站抓取多页
下面是我认为错误的代码:
page_number=200
while True:try:

url = (f'https://www.boxofficemojo.com/chart/ww_top_lifetime_gross/?area=XWW200&offset={page_number}'
    response =requests.get(url)
    response = response.content
    soup = bs(response,'html.parser')
    if len(page_number) == 600:
        print("finished")
        break
    movies_tank = soup.find("tbody")
    detail = soup.find_all("tr")
    All_movies = []
    for de in detail :
        title = de.find("td",{"class":"a-text-left mojo-field-type-title"})
        Worldwide_gross= de.find("td",{"class":"a-text-right mojo-field-type-money"})
        year =de.find("td",{"class":"a-text-left mojo-field-type-year"})
   # if title is not None:
        if title is not None:
             title= (title.text)
        if Worldwide_gross is not None:
             Worldwide_gross=(Worldwide_gross.text.strip('$'))
        if year is not None:
              year= (year).text
        All_movies.append([title,Worldwide_gross,year])
    page_number =+ 200
    print(url)
   
except:
    break

尝试抓取多个页面,但只得到一个页面,我不知道while循环中的错误在哪里

kkbh8khc

kkbh8khc1#

考虑到你的bs 4版本大于3. 8,我写了这个答案。它依赖于脚本中的伪CSS选择器,bs 4仅支持版本>=3.8的功能。

import requests
from bs4 import BeautifulSoup

link = 'https://www.boxofficemojo.com/chart/ww_top_lifetime_gross/'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}
params = {
    'area': 'XWW200',
    'offset': 1
}

with requests.Session() as s:
    s.headers.update(headers)

    all_movies = []

    while params['offset']<=601:   #use the maximum number of pages you would like your script to iterate through
        res = s.get(link,params=params)
        soup = BeautifulSoup(res.content,"html.parser")
        for item in soup.select("#table table tr:not(:has(th))"):
            title = item.select_one("td.mojo-field-type-title").get_text(strip=True)
            wwl_gross = item.select_one("td.mojo-field-type-money").get_text(strip=True).strip("$")
            year = item.select_one("td.mojo-field-type-year").get_text(strip=True)
            all_movies.append([title,wwl_gross,year])
            print(title,wwl_gross,year)
        params['offset']+=200

    print(all_movies)
iovurdzv

iovurdzv2#

你对页数和增量有问题,
更改:len(page_number) == 600page_number == 600
并更改:page_number =+ 200page_number += 1
完整代码:

import requests
from bs4 import BeautifulSoup as bs
page_number = 1

while True:
    try:
        url = (f'https://www.boxofficemojo.com/chart/ww_top_lifetime_gross/?area=XWW200&offset={page_number}')

        response =requests.get(url)
        response = response.content

        soup = bs(response, 'html.parser')

        if page_number == 600:
            print("finished")
            break

        movies_tank = soup.find("tbody")
        detail = soup.find_all("tr")
        All_movies = []
        for de in detail:
            title = de.find("td", {"class": "a-text-left mojo-field-type-title"})
            Worldwide_gross = de.find("td", {"class": "a-text-right mojo-field-type-money"})
            year = de.find("td", {"class": "a-text-left mojo-field-type-year"})
            # if title is not None:
            if title is not None:
                title = (title.text)
            if Worldwide_gross is not None:
                Worldwide_gross = (Worldwide_gross.text.strip('$'))
            if year is not None:
                year = (year).text
            All_movies.append([title, Worldwide_gross, year])
        print(url)
        page_number += 1

    except:
        break

相关问题