我正试图预测IBM的股票价格。但是我在处理线性回归算法中用于模型训练的日期列字段上有gottchas。我的数据集看起来是这样的:
Date Open High Low Close Adj Close Volume
0 1962-01-02 7.713333 7.713333 7.626667 7.626667 0.618153 387200
1 1962-01-03 7.626667 7.693333 7.626667 7.693333 0.623556 288000
2 1962-01-04 7.693333 7.693333 7.613333 7.616667 0.617343 256000
3 1962-01-05 7.606667 7.606667 7.453333 7.466667 0.605185 363200
4 1962-01-08 7.460000 7.460000 7.266667 7.326667 0.593837 544000
我的代码是:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import TimeSeriesSplit
from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np
df = pd.read_csv('IBM.csv')
df['Date'] = pd.to_datetime(df.Date)
df.set_index('Date', inplace=True)
X = df.drop('Adj Close', axis='columns')
Y = df['Adj Close']
scaler = MinMaxScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
timesplit= TimeSeriesSplit(n_splits=10)
for train_index, test_index in timesplit.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = Y[train_index], Y[test_index]
我得到一个错误:
KeyError: "None of [Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 1323, 1324, 1325, 1326, 1327, 1328, 1329, 1330, 1331, 1332],\n dtype='int64', length=1333)]
are in the [columns]"
即使我设法让它工作,我也无法训练我的模型。
1条答案
按热度按时间sirbozc51#
您对
X
和Y
Dataframe 行方向进行切片的语法实际上是尝试对它们列方向进行切片。请参阅indexing and selecting data上的Pandas文档。
尝试替换:
X_train, X_test = X[train_index], X[test_index]
有:
X_train, X_test = X.loc[train_index, :], X.loc[test_index, :]
这样做,您的代码运行良好。