考虑以下真实的数据集
import requests
import zipfile
import io
import pandas as pd
import numpy as np
from datetime import datetime
from dateutil import relativedelta
url = 'http://qed.econ.queensu.ca/jae/datasets/hsiao003/hcw-data.zip'
filename = 'hcw-data.txt'
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()
df = (pd.read_csv(url, sep='\t+', header=None, engine='python')
.stack().rename_axis(['time', 'id']).rename('gdp').reset_index()
.assign(time=lambda x: x['time'] + 1))
df['id'] = df['id']+1
df = df[['id', 'time', 'gdp']]
df = df.sort_values(by=['id', 'time'])
df = df.reset_index()
start_date = datetime.strptime("1993-01-01", "%Y-%m-%d")
# periods means how many dates you want
date_list = pd.date_range(start_date, periods=61, freq='Q')
df['dates'] = pd.DataFrame({'dates': date_list})
这里我们有一个25个国家的人均GDP数据集。我想为反映时间序列的季度-年度的每个单位(id
)生成一个dataframe列。因此,从1到25的每个id
都将1993 q1到2008 q1索引为time
1到61。我当前的代码返回dataframe
index id time gdp dates
0 0 1 1 0.0620 1993-03-31
1 25 1 2 0.0590 1993-06-30
2 50 1 3 0.0580 1993-09-30
3 75 1 4 0.0620 1993-12-31
4 100 1 5 0.0790 1994-03-31
... .. ... ... ...
1520 1424 25 57 0.1110 NaT
1521 1449 25 58 0.1167 NaT
1522 1474 25 59 0.1002 NaT
1523 1499 25 60 0.1017 NaT
1524 1524 25 61 0.1238 NaT
[1525 rows x 5 columns]
我想这是实现的一部分,但是我如何对每个单元执行此操作,以使dates
列没有缺失值?
1条答案
按热度按时间j2qf4p5b1#
尝试:
图纸: