如何使用Scrapy从RealGM中抓取玩家数据?

5jvtdoz2  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(170)

首先,我尝试从RealGM中提取字段,例如:https://basketball.realgm.com/player/player/Summary/1https://basketball.realgm.com/player/player/Summary/160000显示器
我尝试从玩家配置文件框中提取每一条信息,因此在第一个示例中,我希望提取:格雷格·奥登C#20出生日期:1988年1月22日(33岁)出生地/家乡:纽约州布法罗市国籍:美国身高:7-0(213厘米)重量:273(124公斤)吃水深度:2007年NBA选秀预备队:俄亥俄州州立(Fr)高中:劳伦斯北高中[印第安纳波利斯
我没有太多的成功,我能够得到下面的代码在拉href工作,这是不完美的,但我可以与它一起工作。问题是我得到了一个错误,我认为这是因为不是所有的球员有相同的数据字段,上面的例子是我想要的最大输出,但有一些球员谁不会有生日,一些不会有一个预选秀队,所以对于那些我需要它只为该字段拉一个空白,并继续刮。拉一个字段,如身高/体重,那里没有href和一切都包裹在我没有成功拉,每当我引用该部分它拉一个空白。
任何帮助都将不胜感激!这是我目前所拥有的,但我被卡住了:

import scrapy

class RealGMSpider(scrapy.Spider):
    name = "players"

    start_urls = [
            'https://basketball.realgm.com/player/player/Summary/1',
'https://basketball.realgm.com/player/player/Summary/2',
'https://basketball.realgm.com/player/player/Summary/160000'

    ]

    def parse(self, response):

        for player in response.css('.profile-box .container , .level-1'):
                yield {
                        'name': player.css('span::text')[1].get,
                        'link': player.css('a.selected').attrib['href'],
                        'bday': player.css('.half-column-left img+ p a').attrib['href'],
                        'htwn': player.css('p:nth-child(4) a').attrib['href'],
                        'ntion': player.css('.half-column-left p~ p+ p a').attrib['href'],
                        'cteam': player.css('.half-column-right img+ p a').attrib['href'],
                        'agent': player.css('.half-column-right p:nth-child(5) a').attrib['href'],
                        'draftyr': player.css('p:nth-child(6) a').attrib['href'],
                        'earlyen': player.css('p:nth-child(7) a').attrib['href'],
                        'drafted': player.css('p:nth-child(8) a').attrib['href'],
                        'predraft': player.css('p:nth-child(9) a').attrib['href'],
                        'hs': player.css('p:nth-child(10) a').attrib['href']

                }
ryevplcw

ryevplcw1#

没关系,我可以使用美丽的汤!

import csv ;import requests
from bs4 import BeautifulSoup
import csv
import re

url_list = ['https://basketball.realgm.com/player/player/Summary/2',
            'https://basketball.realgm.com/player/player/Summary/1']

for url in url_list:
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')

    player = soup.find_all('div', class_='wrapper clearfix container')[0]

    playerprofile = re.sub(
        r'\n\s*\n', r'\n', player.get_text().strip(), flags=re.M)

    output = playerprofile + "\n"

相关问题