这是我第一次主动使用StackOverflow,所以请原谅任何错误。我目前正在写一个Python3脚本,应该是刮蒸汽社区市场的图标,名称和价格。数据的提取和格式化按预期工作。网站使用分页,所以我必须发出多个GET请求才能覆盖所有169个页面。我的方法是使用for循环并在URL中插入循环变量,因为我注意到当前页面包含在其中。
我的问题是,当我执行脚本并打印应该包含数据的数组时,90%的数据完全相同。(例如,页面2的内容被添加到数组7次)
我不确定如何修复它,并从请求中获得正确的数据。
我希望这个描述是足够清楚的,感谢任何帮助提前。
下面是源代码:
import requests
from bs4 import BeautifulSoup
import time
import json as json
def main():
name_arr = []
img_arr = []
price_arr = []
for i in range(1,11): # later change to 169 pages
url = f"https://steamcommunity.com/market/search?q=&category_730_ItemSet%5B%5D=any&category_730_ProPlayer%5B%5D=any&category_730_StickerCapsule%5B%5D=any&category_730_TournamentTeam%5B%5D=any&category_730_Weapon%5B%5D=any&category_730_Exterior%5B%5D=tag_WearCategory2&category_730_Quality%5B%5D=tag_normal&category_730_Quality%5B%5D=tag_unusual&appid=730#p{i}_popular_desc"
print(url)
r = requests.get(url)
print("----------------------------------- on : " + str(i) + "right now")
print(r.status_code)
soup = BeautifulSoup(r.content, "html.parser")
images = soup.find_all("img", class_="market_listing_item_img")
names = soup.find_all("span", class_="market_listing_item_name")
prices = soup.find_all("span", class_="sale_price")
def extract_text(list, list_arr):
for x in list:
name_only = x.text.replace("(Field-Tested)", "").strip()
list_arr.append(name_only)
def extract_src(list, list_arr):
for x in list:
list_arr.append(x["src"])
extract_text(names, name_arr)
extract_text(prices,price_arr)
extract_src(images, img_arr)
time.sleep(60)
print(name_arr)
print(price_arr)
print(img_arr)
with open('output.json', 'w') as f:
# Write the array to file as JSON
json.dump(name_arr, f)
# amount = float(dollars.replace("$", "").strip())
if __name__ == "__main__":
main()
这里是终端输出,注意名称是如何在那里多次出现的:
❯ python3 webscrape.py
['P90 | Blind Spot', 'SCAR-20 | Cardiac', 'Five-SeveN | Contractor', 'PP-Bizon | Forest Leaves', 'XM1014 | Urban Perforated', 'Sawed-Off | Irradiated Alert', 'SG 553 | Tornado', 'P250 | Mehndi', 'FAMAS | Commemoration', 'XM1014 | Blaze Orange', 'P90 | Blind Spot', 'SCAR-20 | Cardiac', 'Five-SeveN | Contractor', 'PP-Bizon | Forest Leaves', 'XM1014 | Urban Perforated', 'Sawed-Off | Irradiated Alert', 'SG 553 | Tornado', 'P250 | Mehndi', 'FAMAS | Commemoration', 'XM1014 | Blaze Orange', 'P90 | Blind Spot', 'SCAR-20 | Cardiac', 'Five-SeveN | Contractor', 'PP-Bizon | Forest Leaves', 'XM1014 | Urban Perforated', 'Sawed-Off | Irradiated Alert', 'SG 553 | Tornado', 'P250 | Mehndi', 'FAMAS | Commemoration', 'XM1014 | Blaze Orange', 'Sawed-Off | Highwayman', 'Galil AR | Shattered', 'AUG | Torque', 'SG 553 | Tornado', 'Dual Berettas | Briar', 'SG 553 | Wave Spray', 'Five-SeveN | Kami', 'FAMAS | Contrast Spray', 'MAG-7 | Chainmail', 'Sawed-Off | Serenity', 'P90 | Blind Spot', 'SCAR-20 | Cardiac', 'Five-SeveN | Contractor', 'PP-Bizon | Forest Leaves', 'XM1014 | Urban Perforated', 'Sawed-Off | Irradiated Alert', 'SG 553 | Tornado', 'P250 | Mehndi', 'FAMAS | Commemoration', 'XM1014 | Blaze Orange', 'P90 | Blind Spot', 'SCAR-20 | Cardiac', 'Five-SeveN | Contractor', 'PP-Bizon | Forest Leaves', 'XM1014 | Urban Perforated', 'Sawed-Off | Irradiated Alert', 'SG 553 | Tornado', 'P250 | Mehndi', 'FAMAS | Commemoration', 'XM1014 | Blaze Orange', 'P90 | Blind Spot', 'SCAR-20 | Cardiac', 'Five-SeveN | Contractor', 'PP-Bizon | Forest Leaves', 'XM1014 | Urban Perforated', 'Sawed-Off | Irradiated Alert', 'SG 553 | Tornado', 'P250 | Mehndi', 'FAMAS | Commemoration', 'XM1014 | Blaze Orange', 'Sawed-Off | Highwayman', 'Galil AR | Shattered', 'AUG | Torque', 'SG 553 | Tornado', 'Dual Berettas | Briar', 'SG 553 | Wave Spray', 'Five-SeveN | Kami', 'FAMAS | Contrast Spray', 'MAG-7 | Chainmail', 'Sawed-Off | Serenity', 'Sawed-Off | Highwayman', 'Galil AR | Shattered', 'AUG | Torque', 'SG 553 | Tornado', 'Dual Berettas | Briar', 'SG 553 | Wave Spray', 'Five-SeveN | Kami', 'FAMAS | Contrast Spray', 'MAG-7 | Chainmail', 'Sawed-Off | Serenity', 'P90 | Blind Spot', 'SCAR-20 | Cardiac', 'Five-SeveN | Contractor', 'PP-Bizon | Forest Leaves', 'XM1014 | Urban Perforated', 'Sawed-Off | Irradiated Alert', 'SG 553 | Tornado', 'P250 | Mehndi', 'FAMAS | Commemoration', 'XM1014 | Blaze Orange']
1条答案
按热度按时间c0vxltue1#
您在页面上看到的数据是在JavaScript的帮助下从其他URL加载的。你可以用
requests
模块来模拟:图纸: