如何在Pandas中将年份转换为间隔

snvhrwxg  于 2022-12-02  发布在  其他
关注(0)|答案(2)|浏览(107)

我有以下格式的数据:enter image description here
我想把年份列转换成时间间隔(十年),这样我就有了1950-1959,1960-1969等格式的十年列(不删除公司名称)。这样我就可以找到十年内收入最高的公司,然后用seborn绘制前5名公司的收入(所有时间间隔)。
我尝试了以下脚本。

df_Fortune.groupby(['Year', 'Company']).sum().sort_values(['Year', 'Revenue (in millions)'], ascending=[1, 0])

结果是一个多索引(我猜),我不知道如何将Year转换为decident。x1c 0d1x

2vuwiymt

2vuwiymt1#

创建演示示例。

import pandas as pd
import numpy as np
# create a dataframe with 100 rows random with column year random between 1950-2019
df = pd.DataFrame({'year': np.random.randint(1950, 2020, 100)})
df['revenue'] = np.random.randint(1000, 10000, 100)
df.sort_values(by='year', inplace=True)
df.reset_index(drop=True, inplace=True)
df['year_interval'] = pd.cut(df['year'], bins=range(1950, 2025, 5), labels=range(1950, 2020, 5), include_lowest=True)
df['year_interval'] = df['year_interval'].astype(str) + '-' + (df['year_interval'].astype(int) + 4).astype(str)
df['company'] =['Walmart', 'Amazon', 'Apple', 'CVS Health', 'UnitedHealth Group', 'Exxon Mobil', 'Berkshire Hathaway', 'Alphabet', 'McKesson', 'AmerisourceBergen', 'Costco Wholesale', 'Cigna', 'AT&T', 'Microsoft', 'Cardinal Health', 'Chevron', 'Home Depot', 'Walgreens Boots Alliance', 'Marathon Petroleum', 'Elevance Health', 'Kroger', 'Ford Motor', 'Verizon Communications', 'JPMorgan Chase', 'General Motors', 'Centene', 'Meta Platforms', 'Comcast', 'Phillips 66', 'Valero Energy', 'Dell Technologies', 'Target', 'Fannie Mae', 'United Parcel Service', 'Lowe\'s', 'Bank of America', 'Johnson & Johnson', 'Archer Daniels Midland', 'FedEx', 'Humana', 'Wells Fargo', 'State Farm Insurance', 'Pfizer', 'Citigroup', 'PepsiCo', 'Intel', 'Procter & Gamble', 'General Electric', 'IBM', 'MetLife', 'Prudential Financial', 'Albertsons', 'Walt Disney', 'Energy Transfer', 'Lockheed Martin', 'Freddie Mac', 'Goldman Sachs Group', 'Raytheon Technologies', 'HP', 'Boeing', 'Morgan Stanley', 'HCA Healthcare', 'AbbVie', 'Dow', 'Tesla', 'Allstate', 'American International Group', 'Best Buy', 'Charter Communications', 'Sysco', 'Merck', 'New York Life Insurance', 'Caterpillar', 'Cisco Systems', 'TJX', 'Publix Super Markets', 'ConocoPhillips', 'Liberty Mutual Insurance Group', 'Progressive', 'Nationwide', 'Tyson Foods', 'Bristol-Myers Squibb', 'Nike', 'Deere', 'American Express', 'Abbott Laboratories', 'StoneX Group', 'Plains GP Holdings', 'Enterprise Products Partners', 'TIAA', 'Oracle', 'Thermo Fisher Scientific', 'Coca-Cola', 'General Dynamics', 'CHS', 'USAA', 'Northwestern Mutual', 'Nucor', 'Exelon', 'Massachusetts Mutual Life Insurance']
df
###
    year  revenue year_interval                              company
0   1951     8951     1950-1954                              Walmart
1   1954     7270     1950-1954                               Amazon
2   1955     7148     1950-1954                                Apple
3   1955     5661     1950-1954                           CVS Health
4   1955     5179     1950-1954                   UnitedHealth Group
..   ...      ...           ...                                  ...
95  2016     4945     2015-2019                                 USAA
96  2016     6860     2015-2019                  Northwestern Mutual
97  2017     6535     2015-2019                                Nucor
98  2018     6235     2015-2019                               Exelon
99  2019     8624     2015-2019  Massachusetts Mutual Life Insurance

[100 rows x 4 columns]

查找具有每个year_interval的最大收入的公司

df_max = df.groupby('year_interval')['revenue'].max().reset_index()
df_result = df_max.merge(df, on=['year_interval', 'revenue'], how='left')
df_result
###
   year_interval  revenue  year                              company
0      1950-1954     8951  1951                              Walmart
1      1955-1959     8891  1959                             McKesson
2      1960-1964     9643  1962                                Cigna
3      1965-1969     9723  1970                      Elevance Health
4      1970-1974     9396  1973                       General Motors
5      1975-1979     7048  1978                              Comcast
6      1980-1984     9776  1982                United Parcel Service
7      1985-1989     9216  1986                 State Farm Insurance
8      1990-1994     8788  1994                       Morgan Stanley
9      1995-1999     7339  1997                             Best Buy
10     2000-2004     9750  2003       Liberty Mutual Insurance Group
11     2005-2009     9986  2008                                Deere
12     2010-2014     9438  2014                            Coca-Cola
13     2015-2019     8624  2019  Massachusetts Mutual Life Insurance

情节、

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="whitegrid")
plt.gcf().set_size_inches(15, 6)
ax = sns.barplot(x="year_interval", y="revenue", hue="company", data=df_result, dodge=False)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.legend(bbox_to_anchor=(1.15, 1), loc=2, borderaxespad=0.)

for p in ax.patches:
    ax.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')
plt.tight_layout()
plt.show()

nue99wik

nue99wik2#

分别保存区间极值

df['decade1'] = df['year'] - df['year'] % 10
df['decade2'] = df['year'] + (10 - df['year'] % 10)

此外,还可以将其保存为所需格式的字符串

df['decade'] = df['decade1'].astype(str).str.cat(df['decade1'].values.astype(str), sep='-')

除此之外,你可能会找到处理Pandas时间序列数据的方法,而我对此并不知情。

相关问题