我是一个使用Python进行网页抓取的新手,一直在尝试从www.example.com收集数据books.toscrape.com并将其导出到CSV文件。我想使用for循环从所有50个页面收集所有数据。这是我迄今为止所做的
#What I need to extract from the page in order to perform my analysis:
# -Pages
# -Prices
# -Ratings
# -Title
# -URLs(images)
import bs4
from bs4 import BeautifulSoup
import requests
import pandas as pd
import requests
#Creating empty lists to append the extracted data to later.
pagesList=[]
pricesList=[]
ratingsList=[]
titleList=[]
urlsList=[]
#Dictionary containing all of the data scraped from the site.
book_data={'Title':titleList,'Price':pricesList,'Ratings':ratingsList,'URL':urlsList}
#Number of pages to be selected
no_of_pages = 50
#Looping through the required pages and selecting the pages accordingly.
for i in range(1,no_of_pages+1): #to include the last page
#This url variable will follow the generic URL replacing the page number according to the output of the for loop.
url=('https://books.toscrape.com/catalogue/page-{}.html'.format(i))
pagesList.append(url) #adding all of the content of the respective pages to the pages list
print("Number of pages: ",len(pagesList))
print(pagesList)
#Using requests to get the items from the page and convert it from request object to beautiful soup object, then checking it.
for item in pagesList:
page=requests.get(item)
soup=bs4.BeautifulSoup(page.text,'html.parser')
#Structures the output when printing the soup variable
print(soup.prettify())
#Get all of the titles for the title list and append them to the title list without the tags
for t in soup.findAll('h3'):
titles=t.getText()
titleList.append(titles)
#print(titleList)
#Find all of the prices (they are in the <p> tag) and display them to check that it worked.
for p in soup.find_all('p', class_='price_color'): #the 'p' tag is
price=p.getText()
pricesList.append(price)
#print(pricesList)
#Finding the ratings and adding them to the ratingsList out of the "star-rating" class
for s in soup.find_all('p', class_='star-rating'):
for k,v in s.attrs.items(): #k = class and v = star-rating
star=v[1] #using indexing to get the string value of the star-rating value
ratingsList.append(star) #appending to the list
print(star) #the star list now contains all of the star-ratings of the books
#print(ratingsList)
#Finding all of the image URLs in the image_container class
divs=soup.find_all('div', class_='image_container') #fetching all of the divs in this image_container class
#print(divs)
for thumbs in divs:
tags=thumbs.find('img', class_='thumbnail')
#print(tags)
links='https://books.toscrape.com/' + str(tags['src'])
newlinks=links.replace('..','') #to get rid of the dots that appear and replace them with nothing
urlsList.append(newlinks) #the URL list now contains all of the URLs of the book images
#print(urlsList)
#Dictionary containing all of the data scraped from the site.
web_data={'Title':titleList,'Price':pricesList,'Ratings':ratingsList,'URL':urlsList}
#Making sure all of the arrays are of the same length, unless Pandas indexing will not work
print(len(titleList))
print(len(pricesList))
print(len(ratingsList))
print(len(urlsList))
#Converting dictionary to a Pandas dataframe
df=pd.DataFrame(web_data)
#Checking the dataframe conversion worked
df
#Changing the index to start from 1 instead of 0
df.index+=1
#Getting rid of the currency symbol in the Price column
#[x.strip('£') for x in df.Price]
df['Price']=df['Price'].str.replace('£','')
#Sort by the highest price
df.sort_values(by='Price',ascending=False, inplace = True) #so the changes reflect in the original sort
#Converting the Ratings column from string to the corresponding integer
df['Ratings']=df['Ratings'].replace({'Three':3,'One':1,'Two':2,'Four':4,'Five':5})
df
#Checking the data types to make sure price and ratings converted to correct dtypes
df.dtypes
#Converting the price column from object to float
df['Price']=df['Price'].astype(float)
df.dtypes
df.to_csv('bookstore.csv')
看起来导出的CSV只包含20行数据,我不确定这些数据来自哪里(确切的页面)。我花了比我愿意承认的更多的时间在这上面,我太尴尬了,因为我可能错过了一些超级简单的东西。如果有人能在这方面帮助我,我会非常感激。
我怀疑问题出在这段代码上:
#Number of pages to be selected
no_of_pages = 50
#Looping through the required pages and selecting the pages accordingly.
for i in range(1,no_of_pages+1): #to include the last page
#This url variable will follow the generic URL replacing the page number according to the output of the for loop.
url=('https://books.toscrape.com/catalogue/page-{}.html'.format(i))
pagesList.append(url) #adding all of the content of the respective pages to the pages list
我只是不知道问题出在哪里。
我只得到了20行数据,而我应该得到大约1000行数据。
先谢谢你了。
1条答案
按热度按时间ergxz8rk1#
您正在迭代
pagesList
,但只对最后一个soup
进行操作,因为您对后面的循环的标识不正确。因此,您可以纠正该行为以更接近您的结果:
或者使用下面的例子,它避免了几个循环和列表。
示例