python-3.x 尝试使用for循环从www.example.com抓取数据books.toscrape.com,并仅从1个页面而不是所有50个页面获取数据

jjjwad0x  于 2023-04-13  发布在  Python
关注(0)|答案(1)|浏览(94)

我是一个使用Python进行网页抓取的新手,一直在尝试从www.example.com收集数据books.toscrape.com并将其导出到CSV文件。我想使用for循环从所有50个页面收集所有数据。这是我迄今为止所做的

#What I need to extract from the page in order to perform my analysis:
# -Pages
# -Prices
# -Ratings
# -Title
# -URLs(images)

import bs4
from bs4 import BeautifulSoup
import requests
import pandas as pd
import requests

#Creating empty lists to append the extracted data to later.
pagesList=[]
pricesList=[]
ratingsList=[]
titleList=[]
urlsList=[]

#Dictionary containing all of the data scraped from the site.
book_data={'Title':titleList,'Price':pricesList,'Ratings':ratingsList,'URL':urlsList}

#Number of pages to be selected
no_of_pages = 50

#Looping through the required pages and selecting the pages accordingly.
for i in range(1,no_of_pages+1): #to include the last page
    #This url variable will follow the generic URL replacing the page number according to the output of the for loop.
    url=('https://books.toscrape.com/catalogue/page-{}.html'.format(i))
    pagesList.append(url) #adding all of the content of the respective pages to the pages list

print("Number of pages: ",len(pagesList))
print(pagesList)

#Using requests to get the items from the page and convert it from request object to beautiful soup object, then checking it.
for item in pagesList:
    page=requests.get(item)
    soup=bs4.BeautifulSoup(page.text,'html.parser') 
    
#Structures the output when printing the soup variable
print(soup.prettify())

#Get all of the titles for the title list and append them to the title list without the tags
for t in soup.findAll('h3'):
    titles=t.getText()
    titleList.append(titles)

#print(titleList)

#Find all of the prices (they are in the <p> tag) and display them to check that it worked.
for p in soup.find_all('p', class_='price_color'): #the 'p' tag is
    price=p.getText()
    pricesList.append(price)
    
#print(pricesList)

#Finding the ratings and adding them to the ratingsList out of the "star-rating" class
for s in soup.find_all('p', class_='star-rating'):
    for k,v in s.attrs.items(): #k = class and v = star-rating
        star=v[1] #using indexing to get the string value of the star-rating value
        ratingsList.append(star) #appending to the list
        print(star) #the star list now contains all of the star-ratings of the books

#print(ratingsList)

#Finding all of the image URLs in the image_container class
divs=soup.find_all('div', class_='image_container') #fetching all of the divs in this image_container class
#print(divs)
for thumbs in divs:
    tags=thumbs.find('img', class_='thumbnail')
    #print(tags)
    links='https://books.toscrape.com/' + str(tags['src'])
    newlinks=links.replace('..','') #to get rid of the dots that appear and replace them with nothing
    urlsList.append(newlinks) #the URL list now contains all of the URLs of the book images

#print(urlsList)

#Dictionary containing all of the data scraped from the site.
web_data={'Title':titleList,'Price':pricesList,'Ratings':ratingsList,'URL':urlsList}

#Making sure all of the arrays are of the same length, unless Pandas indexing will not work
print(len(titleList))
print(len(pricesList))
print(len(ratingsList))
print(len(urlsList))

#Converting dictionary to a Pandas dataframe
df=pd.DataFrame(web_data)
#Checking the dataframe conversion worked
df

#Changing the index to start from 1 instead of 0
df.index+=1

#Getting rid of the currency symbol in the Price column
#[x.strip('£') for x in df.Price]
df['Price']=df['Price'].str.replace('£','')

#Sort by the highest price
df.sort_values(by='Price',ascending=False, inplace = True) #so the changes reflect in the original sort

#Converting the Ratings column from string to the corresponding integer
df['Ratings']=df['Ratings'].replace({'Three':3,'One':1,'Two':2,'Four':4,'Five':5})
df

#Checking the data types to make sure price and ratings converted to correct dtypes
df.dtypes

#Converting the price column from object to float
df['Price']=df['Price'].astype(float)
df.dtypes

df.to_csv('bookstore.csv')

看起来导出的CSV只包含20行数据,我不确定这些数据来自哪里(确切的页面)。我花了比我愿意承认的更多的时间在这上面,我太尴尬了,因为我可能错过了一些超级简单的东西。如果有人能在这方面帮助我,我会非常感激。
我怀疑问题出在这段代码上:

#Number of pages to be selected
no_of_pages = 50

#Looping through the required pages and selecting the pages accordingly.
for i in range(1,no_of_pages+1): #to include the last page
    #This url variable will follow the generic URL replacing the page number according to the output of the for loop.
    url=('https://books.toscrape.com/catalogue/page-{}.html'.format(i))
    pagesList.append(url) #adding all of the content of the respective pages to the pages list

我只是不知道问题出在哪里。
我只得到了20行数据,而我应该得到大约1000行数据。
先谢谢你了。

ergxz8rk

ergxz8rk1#

您正在迭代pagesList,但只对最后一个soup进行操作,因为您对后面的循环的标识不正确。

...
for item in pagesList:
    page=requests.get(item)
    soup=bs4.BeautifulSoup(page.text,'html.parser')

for ...

因此,您可以纠正该行为以更接近您的结果:

...
for item in pagesList:
    page=requests.get(item)
    soup=bs4.BeautifulSoup(page.text,'html.parser')

    for ...

或者使用下面的例子,它避免了几个循环和列表。

示例
import requests
from bs4 import BeautifulSoup
import pandas as pd

book_data=[]

no_of_pages = 50

for i in range(1,no_of_pages+1):
    url=('https://books.toscrape.com/catalogue/page-{}.html'.format(i))
    page=requests.get(url)
    soup=BeautifulSoup(page.text,'html.parser')
    for e in soup.select('article'):
        book_data.append({
            'title':e.h3.get_text(strip=True),
            'price':e.select_one('.price_color').text[2:],
            'additional':'data'
        })

pd.DataFrame(book_data)

相关问题