H2O Python重调与因子列的relevel_by_frequency

mkh04yzy  于 2022-12-10  发布在  Python
关注(0)|答案(1)|浏览(126)

基于H2O的documentation,似乎relevel('most_frequency_category')relevel_by_frequency()应该完成相同的事情。然而,系数估计值是不同的,这取决于用于设置因子列的参考水平的方法。
使用来自sklearn的开源数据集演示了当使用两种水平调整方法设置基准水平时,GLM系数是如何不对齐的。为什么当两个模型的基准水平相同时,系数估计值会不同?

import pandas as pd
from sklearn.datasets import fetch_openml

import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator

h2o.init(max_mem_size=8)

def load_mtpl2(n_samples=100000):
    """
    Fetch the French Motor Third-Party Liability Claims dataset.
    https://scikit-learn.org/stable/auto_examples/linear_model/plot_tweedie_regression_insurance_claims.html
    
    Parameters
    ----------
    n_samples: int, default=100000
      number of samples to select (for faster run time). Full dataset has
      678013 samples.
    """
    # freMTPL2freq dataset from https://www.openml.org/d/41214
    df_freq = fetch_openml(data_id=41214, as_frame=True)["data"]
    df_freq["IDpol"] = df_freq["IDpol"].astype(int)
    df_freq.set_index("IDpol", inplace=True)

    # freMTPL2sev dataset from https://www.openml.org/d/41215
    df_sev = fetch_openml(data_id=41215, as_frame=True)["data"]

    # sum ClaimAmount over identical IDs
    df_sev = df_sev.groupby("IDpol").sum()

    df = df_freq.join(df_sev, how="left")
    df["ClaimAmount"].fillna(0, inplace=True)

    # unquote string fields
    for column_name in df.columns[df.dtypes.values == object]:
        df[column_name] = df[column_name].str.strip("'")
    return df.iloc[:n_samples]

df = load_mtpl2()
df.loc[(df["ClaimAmount"] == 0) & (df["ClaimNb"] >= 1), "ClaimNb"] = 0
df["Exposure"] = df["Exposure"].clip(upper=1)
df["ClaimAmount"] = df["ClaimAmount"].clip(upper=100000)
df["PurePremium"] = df["ClaimAmount"] / df["Exposure"]

X_freq = h2o.H2OFrame(df)
X_freq["VehBrand"] = X_freq["VehBrand"].asfactor()
X_freq["VehBrand"] = X_freq["VehBrand"].relevel_by_frequency()

X_relevel = h2o.H2OFrame(df)
X_relevel["VehBrand"] = X_relevel["VehBrand"].asfactor()
X_relevel["VehBrand"] = X_relevel["VehBrand"].relevel("B1") # most frequent category

response_col = "PurePremium"
weight_col = "Exposure"
predictors = "VehBrand"

glm_freq = H2OGeneralizedLinearEstimator(family="tweedie",
                                      solver='IRLSM',
                                      tweedie_variance_power=1.5,
                                      tweedie_link_power=0,
                                      lambda_=0,
                                      compute_p_values=True,
                                      remove_collinear_columns=True,
                                      seed=1)

glm_relevel = H2OGeneralizedLinearEstimator(family="tweedie",
                                      solver='IRLSM',
                                      tweedie_variance_power=1.5,
                                      tweedie_link_power=0,
                                      lambda_=0,
                                      compute_p_values=True,
                                      remove_collinear_columns=True,
                                      seed=1)

glm_freq.train(x=predictors, y=response_col, training_frame=X_freq, weights_column=weight_col)
glm_relevel.train(x=predictors, y=response_col, training_frame=X_relevel, weights_column=weight_col)

print('GLM with the reference level set using relevel_by_frequency()')
print(glm_freq._model_json['output']['coefficients_table'])
print('\n')
print('GLM with the reference level manually set using relevel()')
print(glm_relevel._model_json['output']['coefficients_table'])

输出量

GLM with the reference level set using relevel_by_frequency()
Coefficients: glm coefficients
names         coefficients    std_error    z_value     p_value      standardized_coefficients
------------  --------------  -----------  ----------  -----------  ---------------------------
Intercept     5.40413         1.24082      4.35531     1.33012e-05  5.40413
VehBrand.B2   -0.398721       1.2599       -0.316472   0.751645     -0.398721
VehBrand.B12  -0.061573       1.46541      -0.0420176  0.966485     -0.061573
VehBrand.B3   -0.393908       1.30712      -0.301356   0.763144     -0.393908
VehBrand.B5   -0.282484       1.31929      -0.214118   0.830455     -0.282484
VehBrand.B6   -0.387747       1.25943      -0.307876   0.758177     -0.387747
VehBrand.B4   0.391771        1.45615      0.269047    0.787894     0.391771
VehBrand.B10  -0.0542706      1.35049      -0.040186   0.967945     -0.0542706
VehBrand.B13  -0.306381       1.4628       -0.209449   0.834098     -0.306381
VehBrand.B11  -0.435297       1.29155      -0.337035   0.736091     -0.435297
VehBrand.B14  -0.304243       1.34781      -0.225732   0.821411     -0.304243

GLM with the reference level manually set using relevel()
Coefficients: glm coefficients
names         coefficients    std_error    z_value     p_value     standardized_coefficients
------------  --------------  -----------  ----------  ----------  ---------------------------
Intercept     5.01639         0.215713     23.2549     2.635e-119  5.01639
VehBrand.B10  0.081366        0.804165     0.101181    0.919407    0.081366
VehBrand.B11  0.779518        0.792003     0.984237    0.325001    0.779518
VehBrand.B12  -0.0475497      0.41834      -0.113663   0.909505    -0.0475497
VehBrand.B13  0.326174        0.80891      0.403227    0.686782    0.326174
VehBrand.B14  0.387747        1.25943      0.307876    0.758177    0.387747
VehBrand.B2   -0.010974       0.306996     -0.0357465  0.971485    -0.010974
VehBrand.B3   -0.00616108     0.464188     -0.0132728  0.98941     -0.00616108
VehBrand.B4   0.333477        0.575082     0.579877    0.561999    0.333477
VehBrand.B5   0.105263        0.497431     0.211613    0.832409    0.105263
VehBrand.B6   0.0835042       0.568769     0.146816    0.883278    0.0835042
z9ju0rcb

z9ju0rcb1#

这两个数据集几乎相同,只有一点不同:
在第一个数据集中,B1的VehBrand行数= 72在第二个数据集中,B14的VehBrand行数= 721。
如果您查看并比较这两个数据集,可以将对等名称Map到这两个数据集中的数据列数目,如下所示:
频率B2 ==关联B2,具有26500行
频率B12 ==关联B13,1883行
频率B3 ==关联B3,8260行
频率B5 ==关联B5,6053行
频率B6 ==关联B1,具有27240行
频率B4 ==关联B11,1774行
频率B10 ==关联B4,3968行
频率B13 ==关联B10,具有2268行
频率B11 ==关联B12,具有16619行
频率B14 ==关联B6,4714行。
由于使用不同的数据集训练两个GLM模型,因此将获得不同的系数和不同的预测结果。

相关问题