pandas Python：从td中抓取href-无法使其正常工作

9nvpjoqh 于 2023-03-21 发布在 Python

关注(0)|答案(1)|浏览(107)

我对python还很陌生，已经解决了之前关于SO的问题，但是没有解决。下面是我的代码：

import requests
import pandas as pd
from bs4 import BeautifulSoup
from urllib.parse import urlparse

url = "https://en.wikipedia.org/wiki/List_of_curling_clubs_in_the_United_States"
data = requests.get(url).text

soup = BeautifulSoup(data, 'lxml')
table = soup.find('table', class_='wikitable sortable')

df = pd.DataFrame(columns=['Club Name', 'City/Town', 'State', 'Type', 'Sheets', 'Memberships', 'Year Founded', 'Notes', 'URL'])

for row in table.tbody.find_all('tr'):    
    # Find all data for each column
    columns = row.find_all('td')
    
    if(columns != []):
        club_name = columns[0].text.strip()
        city = columns[1].text.strip()
        state = columns[2].text.strip()
        type_arena = columns[3].text.strip()
        sheets = columns[4].text.strip()
        memberships = columns[5].text.strip()
        year_founded = columns[6].text.strip()
        notes = columns[7].text.strip()
        club_url = columns[0].find('a').get('href')
        
        df = df.append({'Club Name': club_name,  'City/Town': city, 'State': state, 'Type': type_arena, 'Sheets': sheets, 'Memberships': memberships, 'Year Founded': year_founded, 'Notes': notes, 'URL': club_url}, ignore_index=True)

我的DF除了最后一列外都能正常工作。当第一列明显包含链接时，它返回“None”。我该如何解决这个问题？
我已经成功地从没有表格的网站上抓取了HREF，但是我正在努力寻找表格内部的解决方案。

pandas

来源：https://stackoverflow.com/questions/75793727/python-scrape-href-from-td-cant-get-it-to-work-correctly

1条答案

按热度按时间

f0ofjuux1#

您的脚本中有一处打字错误：

club_url = cols[0].find('a').get('href')

cols应该是columns，并且在应用方法之前应该检查元素是否存在：

club_url = columns[0].find('a').get('href') if columns[0].find('a') else None

赞(0）回复(0）举报 2023-03-21

我来回答

pandas Python：从td中抓取href-无法使其正常工作

1条答案

相关问题

热门标签

最新问答