ludwig 分类和数值转换器 - sklearn格式

tf7tbtn2 于 2个月前发布在其他

关注(0)|答案(4)|浏览(26)

@w4nderlust@tgaddair 我注意到ludwig主要关注文本预处理，但在数值和类别编码方面存在一些弱点。
例如，我看不到用户可以声明以下功能：

对于低基数数据集(特征数量较少)的分类数据进行独热编码
稀有标签编码器，用于减少高基数数据集的类别数量
分位数编码，用于减少数值数据集中的离群值
顺序编码，用于对有序分类特征进行编码，例如特征 [非常小，小，正常，大，非常大] =>[0,1,2,3,4]

在我的工作中，我经常使用页面上的优秀编码器或自己编写的编码器：
https://feature-engine.readthedocs.io/en/latest/encoding/index.html
https://contrib.scikit-learn.org/category_encoders/
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing
是否有办法以典型的sklearn格式轻松地将其添加到项目的transformers中，因为这将扩展ludwig的功能：

# These allow us the class to inherit Scikit-learn methods
# such as fit and transform
from sklearn.base import BaseEstimator, TransformerMixin# This function just makes sure that the object is fitted
from sklearn.utils.validation import check_is_fitted

class SubtractMin(BaseEstimator, TransformerMixin):
    def __init__(self, cols_to_operate):
         self.columns = cols_to_operate     
    def fit(self, X, y = None):
         self.min_val_ = X[self.columns].min()
         return self
 
    def transform(self, X):
         # make sure that it was fitted
         check_is_fitted(self, ‘min_val_’)
 
         X = X.copy() # This is so we do not make changes to the            
                        original dataframe
         X[self.columns] = X[self.columns] — self.min_val_
         return X

通过使用多元插补(使用所有特征而不是仅使用单个特征预测数据集中的缺失值)可以实现一些改进。

import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imp_mean = IterativeImputer(random_state=0)
imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])

X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
imp_mean.transform(X)

现在我在ludwig之外进行预处理以准备csv文件，但也许可以将它放在ludwig的一个管道中。

ludwig

来源：https://github.com/ludwig-ai/ludwig/issues/1520

4条答案

按热度按时间

ykejflvf1#

你好，@PeterPirog,感谢你提供的详细问题！

你是对的，对于分类和数值特征，我们可以进行一些额外的预处理(欢迎提交PR!),但同时，你想要的一些特征已经在这里了。

独热编码是通过将类别编码器设置为 sparse 来实现的。老实说，在深度学习模型中，这并不是很有用，因为一旦稀疏表示乘以一个全连接层的矩阵，就相当于从嵌入矩阵中选择一行，而这是默认的 dense 编码器所做的。所以它在那里，但无论如何，它只是默认选项的一个不太高效的版本。
稀有标签：有一个 most_common 参数。如果你的类别特征包含10个不同的值，并且将其设置为5,那么5个最不常见的类都将使用特殊标记进行编码。

对于数值特征的分位数编码将是一个很好的补充，我们已经在待办事项列表中放了一段时间了。另一方面，序数编码可能有用，但作用较小，因为它仍然需要用户手动指定Map(然后在训练Ludwig之前使用pandas进行操作),但仍然是一个不错的选择。

关于转换的常见格式，我们已经在数值转换器方面朝着这个方向前进了，但你的建议很好。与你展示的主要区别在于假设数据框操作应该同时适用于pandas和dask。

关于插补，这也是一个很好的建议，尽管sklearn插补假设数据是平稳的，而Ludwig没有这样的假设，因为数据类型不同。与此同时，我们可以使用pandas/dask做类似的事情。

总的来说，预处理将是v0.6版本发布的主要关注点，这将在我们完成v0.5 PyTorch移植之后开始工作。非常高兴收到这些详细的建议！

赞(0）回复(0）举报 2个月前

siotufzp2#

你好，@w4nderlust
我不知道有些编码器已经实现了(稀疏=独热编码)。当然，对于具有高基数的数据，独热编码完全没有用处，所以我使用分位数编码器处理分类数据(分位数编码器用于数值数据和分类数据是完全不同的)
在具有非常高基数的监督学习中，分位数编码器对分类数据非常有用：
https://contrib.scikit-learn.org/category_encoders/quantile.html
我推荐这篇文章来解释它：https://maxhalford.github.io/blog/target-encoding/
出于个人目的，我自己为分位数编码器编写了一个转换器，但我还没有描述它(代码需要清理):

class PercentileTargetEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, features=None,
                 ignored_features=None,
                 p=0.5,
                 m=1,
                 remove_original=True,
                 return_df=True,
                 use_internal_yeo_johnson=True,
                 verbose=True):
        super().__init__()
        self.features = features  # selected categorical features
        self.ignored_features = ignored_features
        self.columns = None  # all columns in df
        self.column_target = None
        self.p = p
        self.m = m
        self.N = None  # Number of rows in training dataset
        self.remove_original = remove_original
        self.return_df = return_df
        # usage of yeo-johnson transformation inside encoder
        self.use_internal_yeo_johnson = use_internal_yeo_johnson
        self.verbose = verbose
        # dict with unique values lists for specified feature, key form (feature)
        self.features_unique = {}
        # stored quantiles for whole dataset, key form (p)
        self.global_quantiles = {}
        # stored quantiles for all values, key form (feature, value, p)
        self.value_quantiles = {}
        # stored counts of every value in train data key form (feature, value)
        self.value_counts = {}

        # convert p and m to lists for iteration available
        if isinstance(p, int) or isinstance(p, float):
            self.p = [self.p]
        if isinstance(m, int) or isinstance(m, float):
            self.m = [self.m]

        # convert feature lists for iteration available
        if not isinstance(self.features, list) and self.features is not None:
            self.features = [self.features]

        if not isinstance(self.ignored_features, list) and self.ignored_features is not None:
            self.ignored_features = [self.ignored_features]

    def fit(self, X, y=None):
        X = X.copy()
        # Convert y to proper datatype
        # if y is pd.Series
        if isinstance(y, pd.Series):
            y = y.to_frame().copy()
        # if y is np.array
        elif isinstance(y, type(np.array([0]))):
            y = pd.DataFrame(y, columns=['target']).copy()
        # if y is pd.DataFrame
        elif isinstance(y, pd.DataFrame):
            y = y.copy()
        else:
            print("Wrong target 'y' data type")

        # use yeo-johnson transformation for target inside encoder
        if self.use_internal_yeo_johnson:
            y = stats.yeojohnson(y)[0]  # second component is lambda
            y = pd.DataFrame(y, columns=['target']).copy()

        # Count number of rows in training dataset
        self.N = len(y)

        self.columns = X.columns
        # Find only categorical columns if not defines
        # Auto-search categorical features
        if self.features is None:
            self.features = [col for col in self.columns if X[col].dtypes == 'O']
        else:
            # convert single feature name to list for iteration possibility
            if isinstance(self.features, str):
                self.features = [self.features]

        # Remove ignored features
        if self.ignored_features is not None:
            for ignored_feature in self.ignored_features:
                self.features.remove(ignored_feature)

        if self.verbose and X.isnull().values.any():
            print('There were some nan values if specified features. Nan values are replaced')

        # Replace nan values in selected categorical features by 'MISSING" value
        # X[self.features] = X[self.features].fillna('MISSING').copy()

        # print (f"self.features={self.features}")

        # Find unique values for specified features
        for feature in self.features:
            self.features_unique[feature] = list(X[feature].unique())
            # print(f"self.features_unique={self.features_unique}")
            # print(f"self.features_unique[feature]={self.features_unique[feature]}")
            # Replace nan values in selected categorical features by 'MISSING" value
            X[feature] = X[feature].apply(lambda value: value if value in self.features_unique[feature] else 'UNKNOWN')

            # add 'UNKNOWN' value for transform never seen values
            self.features_unique[feature].append('UNKNOWN')

            # add 'MISSING' value whole data  were complete and 'MISSING' key is not created
            if not 'MISSING' in self.features_unique[feature]:
                self.features_unique[feature].append('MISSING')

        # Find quantiles for all dataset for each value of p
        for p in self.p:
            self.global_quantiles[p] = np.quantile(y, p)

        # Find quantiles for every feature and every value
        for feature in self.features:
            unique_vals_for_feature = self.features_unique[feature]
            # for every unique value for feature
            for value in unique_vals_for_feature:
                value_counts = X.loc[X[feature] == value, feature].count()

                # value not exist in training data
                if value_counts == 0:
                    for p in self.p:
                        # replace missing value by quantile for all data
                        self.value_quantiles[feature, value, p] = self.global_quantiles[p]

                        # set value 1 for 'UNKNOWN'  and 'MISSING'values
                        self.value_counts[feature, value] = 1

                # value exist in training data, quantile can be calculated
                else:
                    # Find y values for specified feature and specified value
                    idx = X[feature] == value
                    # value_not_exist_in_data=sum(idx.astype(int))
                    y_for_value = y[idx].copy()
                    # counts for every feature and every value
                    self.value_counts[feature, value] = len(y_for_value)

                    for p in self.p:
                        # quantiles calculation
                        quantile = np.quantile(y_for_value, p, interpolation='linear')
                        self.value_quantiles[feature, value, p] = quantile
        return self

    def transform(self, X):
        X = X.copy()
        # Replace nan values in selected categorical features by 'MISSING" value
        # X[self.features] = X[self.features].fillna('MISSING')

        for feature in self.features:
            # Replace never seen values as 'UNKNOWN'
            # X[feature] = X[feature].apply(lambda value: value if value is pd.notnull(value) else 'MISSING')
            X[feature] = X[feature].apply(lambda value: value if pd.notnull(value) else 'MISSING')
            for p in self.p:
                for m in self.m:
                    # Prepare new columns names for percentile values
                    feature_name = feature + '_' + str(p) + '_' + str(m)
                    print(f'feature name={feature_name}')
                    X[feature_name] = X[feature].apply(lambda value:
                                                       self.__calculate_new_target(feature, value, p, m))

        # Remove original features
        if self.remove_original:
            X = X.drop(self.features, axis=1)

        # Return dataframe or np array
        if self.return_df:
            return X
        else:
            return X.to_numpy()

    def __calculate_new_target(self, feature, value, p, m):
        """
        :param feature: current feature name
        :param value: current value for feature
        :param p: percentile value
        :param m: regularization parameter to prevent overfitting , int in range for 1 to np.inf
        :return: calculated target to replace categorical value
        """
        # N - total number of rows in training set
        N = self.N
        # ni - number of rows with specified 'value' in training set
        ni = self.value_counts[feature, value]
        # eta - proportion of specified value to all values in training set
        eta = ni / N
        # q - quantile value for specified 'feature' and specified 'value'
        q = self.value_quantiles[feature, value, p]
        # mQ - mean quantile for whole dataset
        mQ = self.global_quantiles[p]
        # print(f'eth={eta}')
        return (mQ + m * eta * q) / (1 + m * eta)

用法：
PercentileTargetEncoder(features=None, # 如果为None,则检测分类特征或['feature1','feature2']
ignored_features=None, # 不要转换这些分类特征，为None或列表
p=0.5, # 百分位数 - 如果为0.5,则为中位数或列表，例如[0.1,0.5,0.9]
m=1, # 在文章中描述的平滑参数，在文章中
remove_original=True, # 如果为True,则删除原始分类数据
return_df=True, # 如果为True,则返回数据框，否则返回numpy数组
use_internal_yeo_johnson=True, # 将内部的yeo-johnson变换集成到减少偏度中
verbose=True # 如果为True,则显示转换器内部的计算过程
):

赞(0）回复(0）举报 2个月前

rta7y2nd3#

关于@PeterPirog的更新。我们进行了一次内部会议，在完全完成PyTorch移植后，这些功能被优先作为主要功能之一。这需要更多的耐心，但它们将被添加到Ludwig中。非常感谢您的支持和帮助！

赞(0）回复(0）举报 2个月前

xxslljrj4#

感谢您提供的信息。现在我测试如何有效地对分类特征进行编码，使用两层

我测试了不同特征基数的不同的向量长度。

赞(0）回复(0）举报 2个月前