在databricks的博客中,有一个带有statsmodels链接的pandas udf示例
import statsmodels.api as sm
# df has four columns: id, y, x1, x2
group_column = 'id'
y_column = 'y'
x_columns = ['x1', 'x2']
schema = df.select(group_column, *x_columns).schema
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
# Input/output are both a pandas.DataFrame
def ols(pdf):
group_key = pdf[group_column].iloc[0]
y = pdf[y_column]
X = pdf[x_columns]
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
return pd.DataFrame([[group_key] + [model.params[i] for i in x_columns]], columns=[group_column] + x_columns)
beta = df.groupby(group_column).apply(ols)
如何使用statsmodels公式api重新创建相同的代码?更具体地说,我想定义一个pandas udf,以便它的输入
formula: a string that specifies the R-style regression formula
df: a Pandas DataFrame
例如,
formula='y ~ x1 + x2 + x1:x2 - 1 '
df=pdf
def ols_formula(formula,df):
import statsmodels.formula.api as smf
model = smf.ols(formula, df).fit()
return model
暂无答案!
目前还没有任何答案,快来回答吧!