pandas 合并2个panda Dataframe ,通过网络抓取提取

5lhxktic  于 2022-11-27  发布在  其他
关注(0)|答案(2)|浏览(129)

我正在尝试建立一个关于美国大学的数据库。我一直在使用Beautiful Soup和Pandas来做这件事,但遇到了困难,因为每页有几个表要废弃。为了重新合并从两个表中提取的数据,我尝试使用.merge(),但根本没有成功。
我的代码如下:

# Connecticut
url='https://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_Connecticut'
soup=bs(requests.get(url).text)
table = soup.find_all('table')

#Extracting a df for each table
df1 = pd.read_html(str(table))[0]
df1.rename(columns = {'Enrollment(2020)[4]': 'Enrollment', 'Founded[5]':'Founded'}, inplace = True)

df2 = pd.read_html(str(table))[1]
df2=df2.drop(['Type','Ref.'], axis=1)

df_Connecticut=df1.merge(df2, on=['School','Location','Control','Founded'])
df_Connecticut

我曾试图用其他州做这件事,但还是遇到同样的问题:

Maine
url='https://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_Maine'
soup=bs(requests.get(url).text)
table = soup.find_all('table')

#Extracting a df for each table
df1 = pd.read_html(str(table))[0]
df1=df1.drop(['Type[a]'], axis=1)
df1.rename(columns = {'Location(s)': 'Location', 'Enrollment (2019)[b]':'Enrollment'}, inplace = True)
df1 = df1.astype({'School':'string','Location':'string','Control':'string','Enrollment':'string','Founded':'string'})

df2 = pd.read_html(str(table))[1]
df2=df2.drop(['Cite'], axis=1)
df2.rename(columns = {'Location(s)': 'Location'}, inplace = True)
df2 = df2.astype({'School':'string','Location':'string','Founded':'string','Closed':'string'})

df_Maine=df1.merge(df2,  on=['School','Location','Founded'])
df_Maine```

我完全是Python的初学者。

qnakjoqk

qnakjoqk1#

会同意@Clegane,您可能是指concat,因为合并表示两者中必须有可以匹配的值。

pd.concat([df1,df2], ignore_index=True)

| | 学校|位置|对照组|卡内基分类|入组|成立|已关闭|
| - -|- -|- -|- -|- -|- -|- -|- -|
| 第0页|阿尔伯特·马格努斯学院|纽黑文|私立(天主教)|马斯特斯大学|小行星1793|小行星1925|楠|
| 一个|阿斯农塔克社区学院|恩菲尔德|公开|大专院校|小行星2388| 1969年[6]|楠|
| 一个人。|一个人。|一个人。|一个人。|一个人。|一个人。|一个人。|一个人。|
| 五十个|银矿艺术学院|新迦南|专用|楠|楠|小行星7710|小行星7710|
| 五十一个|威尔科克斯护理学院|米德尔敦|专用|楠|楠|小行星1908|一九九七年|
否则,您应该定址how以建立外部链接:

df_Connecticut=df1.merge(df2, on=['School','Location','Control','Founded'], how='outer')
sqserrrh

sqserrrh2#

这里有一个完整的例子,与你提到的两个州有关。当然,你也可以在你的列表中包括其他州。

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
s = requests.Session()
s.headers.update(headers)
big_df = pd.DataFrame()
states = ['Maine', 'Connecticut']

for x in states:
    r = s.get(f'https://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_{x}')
    soup = bs(r.text, 'html.parser')
    df = pd.read_html(str(soup))[0]
    df.columns = ['School', 'Location', 'Control', 'Type', 'Enrollment', 'Founded']
    df['State'] = x
    big_df = pd.concat([big_df, df], axis=0, ignore_index=True)
print(big_df)

终端结果:

School  Location    Control Type    Enrollment  Founded State
0   Bates College   Lewiston    Private Baccalaureate college   1964    1855    Maine
1   Beal University Bangor  Private (for-profit)    Associates college  229 1891[5] Maine
2   Bowdoin College Brunswick   Private Baccalaureate college   1969    1794    Maine
3   Central Maine Community College Auburn  Public  Associates college  3746    1963[6] Maine
4   Colby College   Waterville  Private Baccalaureate college   2150    1813[7] Maine
... ... ... ... ... ... ... ...
61  University of New Haven West Haven  Private Masters university  7587    1920    Connecticut
62  University of Saint Joseph  West Hartford   Private (Catholic)  Masters university  2975    1932    Connecticut
63  Wesleyan University Middletown  Private Baccalaureate college   3599    1831    Connecticut
64  Western Connecticut State University    Danbury Public  Masters university  6463    1903    Connecticut
65  Yale University New Haven   Private Doctoral university 14910   1701    Connecticut
66 rows × 7 columns

相关问题