pandas 如何优化我的代码,使我的Google Colab不会崩溃

c2e8gylq  于 2023-02-20  发布在  Go
关注(0)|答案(1)|浏览(148)

我遇到了一个问题,谷歌Colab的内存正在耗尽。我使用免费版本,我不确定这是因为它不能处理或如果我的代码是非常糟糕的优化。因为我是新的领域,我相信我的代码是非常缓慢和糟糕的优化。想寻求一点帮助,因为我还在学习。

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor 
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
from google.colab import drive
drive.mount('/content/drive')

df = pd.read_csv('path/beforeNeural.csv')
df.shape
df.head()

df.isnull().sum()

encoder = LabelEncoder()

df['Property Type'] = encoder.fit_transform(df['Property Type'])
df['Old/New'] = encoder.fit_transform(df['Old/New'])
df['Record Status - monthly file only'] = encoder.fit_transform(df['Record Status - monthly file only'])
df['PPDCategory Type'] = encoder.fit_transform(df['PPDCategory Type'])
df['County'] = encoder.fit_transform(df['County'])
df['District'] = encoder.fit_transform(df['District'])
df['Town/City'] = encoder.fit_transform(df['Town/City'])
df['Duration'] = encoder.fit_transform(df['Duration'])
df['Transaction unique identifier'] = encoder.fit_transform(df['Transaction unique identifier'])
df['Date of Transfer'] = encoder.fit_transform(df['Date of Transfer'])

X = df.drop(columns='Price', axis=1)
Y = df['Price']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2)

df.shape

boostenc = XGBRegressor()

boostenc.fit(X_train, Y_train)
lxkprmvk

lxkprmvk1#

我将给予一下,这里有一个可能的选项来优化您的代码,

代码:

import pandas as pd
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from google.colab import drive
drive.mount('/content/drive')

df = pd.read_csv('path/beforeNeural.csv')

categorical_columns = ['Property Type', 'Old/New', 'Record Status - monthly file only', 'PPDCategory Type', 'County', 'District', 'Town/City', 'Duration', 'Transaction unique identifier', 'Date of Transfer']
encoder = OneHotEncoder()
X_concat = encoder.fit_transform(df[categorical_columns])
# Approach 1:
X_concat = pd.DataFrame(X_concat.toarray(), columns = encoder.get_feature_names(categorical_columns))
# Approach 2:
X_concat = pd.SparseDataFrame(X_concat.to_coo(), columns = encoder.get_feature_names(categorical_columns))

X_numerical = df.drop(columns = categorical_columns + ['Price'])
X = pd.concat([X_numerical, X_concat], axis = 1)
Y = df['Price']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 2)

boostenc = XGBRegressor()
boostenc.fit(X_train, Y_train)

注意,我删除了未使用的导入,并删除了代码中间的df.head()等调用,当您在代码中间这样使用它时,它什么也不做,也不打印任何内容

代码解释:

1.我没有使用LabelEncoder,而是使用OneHotEncoder对所有分类特征进行one-hot编码。这会为分类特征中的每个唯一值创建一个新的二进制列。一般来说,在使用机器学习时,one-hot编码通常是处理分类特征的更好方法,而不仅仅是使用LabelEncoder分配整数值。
1.我将所有分类列的名称提取到一个列表中,这样在需要时更容易修改它们。

相关问题