pandas 将数据重新索引为DateTimeIndex会导致所有价格数据变为NaN

mnowg1ta  于 2023-04-28  发布在  其他
关注(0)|答案(2)|浏览(90)
gold = pd.read_csv('gold.csv')

gold['Date'] = gold['Date'].apply(pd.to_datetime, utc=True)

gold
                         Date   Price
0   1950-01-01 00:00:00+00:00   34.730
1   1950-02-01 00:00:00+00:00   34.730
2   1950-03-01 00:00:00+00:00   34.730
3   1950-04-01 00:00:00+00:00   34.730
4   1950-05-01 00:00:00+00:00   34.730
... ...                         ...
842 2020-03-01 00:00:00+00:00   1593.764
843 2020-04-01 00:00:00+00:00   1680.030
844 2020-05-01 00:00:00+00:00   1715.697
845 2020-06-01 00:00:00+00:00   1734.032
846 2020-07-01 00:00:00+00:00   1840.807

gold.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 847 entries, 0 to 846
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype              
---  ------  --------------  -----              
 0   Date    847 non-null    datetime64[ns, UTC]
 1   Price   847 non-null    float64            
dtypes: datetime64[ns, UTC](1), float64(1)
memory usage: 13.4 KB

gold = gold.set_index('Date')

gold.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 847 entries, 1950-01-01 00:00:00+00:00 to 2020-07-01 00:00:00+00:00
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Price   847 non-null    float64
dtypes: float64(1)
memory usage: 13.2 KB

gold        
Date                        Price
1950-01-01 00:00:00+00:00   34.730
1950-02-01 00:00:00+00:00   34.730
1950-03-01 00:00:00+00:00   34.730
1950-04-01 00:00:00+00:00   34.730
1950-05-01 00:00:00+00:00   34.730
...                         ...
2020-03-01 00:00:00+00:00   1593.764
2020-04-01 00:00:00+00:00   1680.030
2020-05-01 00:00:00+00:00   1715.697
2020-06-01 00:00:00+00:00   1734.032
2020-07-01 00:00:00+00:00   1840.807

years = pd.date_range('01/01/1950', '01/01/2020', freq='YS')

years
DatetimeIndex(['1950-01-01', '1951-01-01', '1952-01-01', '1953-01-01',
               '1954-01-01', '1955-01-01', '1956-01-01', '1957-01-01',
               '1958-01-01', '1959-01-01', '1960-01-01', '1961-01-01',
               '1962-01-01', '1963-01-01', '1964-01-01', '1965-01-01',
               '1966-01-01', '1967-01-01', '1968-01-01', '1969-01-01',
               '1970-01-01', '1971-01-01', '1972-01-01', '1973-01-01',
               '1974-01-01', '1975-01-01', '1976-01-01', '1977-01-01',
               '1978-01-01', '1979-01-01', '1980-01-01', '1981-01-01',
               '1982-01-01', '1983-01-01', '1984-01-01', '1985-01-01',
               '1986-01-01', '1987-01-01', '1988-01-01', '1989-01-01',
               '1990-01-01', '1991-01-01', '1992-01-01', '1993-01-01',
               '1994-01-01', '1995-01-01', '1996-01-01', '1997-01-01',
               '1998-01-01', '1999-01-01', '2000-01-01', '2001-01-01',
               '2002-01-01', '2003-01-01', '2004-01-01', '2005-01-01',
               '2006-01-01', '2007-01-01', '2008-01-01', '2009-01-01',
               '2010-01-01', '2011-01-01', '2012-01-01', '2013-01-01',
               '2014-01-01', '2015-01-01', '2016-01-01', '2017-01-01',
               '2018-01-01', '2019-01-01', '2020-01-01'],
              dtype='datetime64[ns]', freq='AS-JAN')

reindexed = gold.reindex(years)

reindexed
            Price
1950-01-01  NaN
1951-01-01  NaN
1952-01-01  NaN
1953-01-01  NaN
1954-01-01  NaN
...         ...
2016-01-01  NaN
2017-01-01  NaN
2018-01-01  NaN
2019-01-01  NaN
2020-01-01  NaN

当我绘制重新索引后,价格没有重新索引后的值。我试图在1950年至2020年的每一年的第一天重新索引黄金,然后将它们绘制在pandas生成的线图上。我已经看过Stack Overflow的各种帖子,但我仍然不确定我做错了什么。

j0pj023g

j0pj023g1#

您需要使用UTC本地化索引,使用tz_localize进行转换:

reindexed = gold.reindex(years.tz_localize('utc'))

输出:

Price
1950-01-01 00:00:00+00:00  34.73
1951-01-01 00:00:00+00:00    NaN  # NaN as the data is
1952-01-01 00:00:00+00:00    NaN  # not in your example
1953-01-01 00:00:00+00:00    NaN
1954-01-01 00:00:00+00:00    NaN
...                          ...
2016-01-01 00:00:00+00:00    NaN
2017-01-01 00:00:00+00:00    NaN
2018-01-01 00:00:00+00:00    NaN
2019-01-01 00:00:00+00:00    NaN
2020-01-01 00:00:00+00:00    NaN

[71 rows x 1 columns]

您也可以尝试resample获取每年的first值:

resampled = gold.resample('YS').first()

输出:

Price
Date                               
1950-01-01 00:00:00+00:00    34.730
1951-01-01 00:00:00+00:00       NaN
1952-01-01 00:00:00+00:00       NaN
1953-01-01 00:00:00+00:00       NaN
1954-01-01 00:00:00+00:00       NaN
...                             ...
2016-01-01 00:00:00+00:00       NaN
2017-01-01 00:00:00+00:00       NaN
2018-01-01 00:00:00+00:00       NaN
2019-01-01 00:00:00+00:00       NaN
2020-01-01 00:00:00+00:00  1593.764

[71 rows x 1 columns]
rqdpfwrv

rqdpfwrv2#

mozway的回答是正确的,问题出在UTC参数上。我最初设置它的方式需要使用UTC本地化索引。
我想补充的是,mozway的回答帮助我弄清楚了,我也可以删除UTC参数,因为我不需要这个数据的时区感知的UTC本地化时间戳。
第二行代码可能是:

gold['Date'] = gold['Date'].apply(pd.to_datetime)

而不是:

gold['Date'] = gold['Date'].apply(pd.to_datetime, utc=True)

两种解决方案都在发挥应有的作用。

相关问题