keras 如何合并数值和嵌入序列模型来处理RNN中的类别

atmip9wb  于 12个月前  发布在  其他
关注(0)|答案(2)|浏览(87)

我想为我的分类特征构建一个带有嵌入的单层LSTM模型。我目前有数值特征和一些分类特征,例如位置,由于计算复杂性,不能使用pd.get_dummies()进行一次性编码,这是我最初打算做的。
让我们想象一个例子:

示例数据

data = {
    'user_id': [1,1,1,1,2,2,3],
    'time_on_page': [10,20,30,20,15,10,40],
    'location': ['London','New York', 'London', 'New York', 'Hong Kong', 'Tokyo', 'Madrid'],
    'page_id': [5,4,2,1,6,8,2]
}
d = pd.DataFrame(data=data)
print(d)
   user_id  time_on_page   location  page_id
0        1            10     London        5
1        1            20   New York        4
2        1            30     London        2
3        1            20   New York        1
4        2            15  Hong Kong        6
5        2            10      Tokyo        8
6        3            40     Madrid        2

字符串
让我们看看访问网站的人。我跟踪的是数字数据,如页面上的时间等。分类数据包括:位置(超过1000个唯一),Page_id(> 1000个唯一),Author_id(100+个唯一)。最简单的解决方案是对所有内容进行one-hot编码,并将其放入具有可变序列长度的LSTM中,每个时间步长对应于不同的页面视图。
上面的DataFrame将生成7个训练样本,序列长度可变。例如,对于user_id=2,我将有2个训练样本:

[ ROW_INDEX_4 ] and [ ROW_INDEX_4, ROW_INDEX_5 ]


假设X是训练数据,让我们看看第一个训练样本X[0]
x1c 0d1x的数据
从上面的图片中,我的分类特征是X[0][:, n:]
在创建序列之前,我使用pd.factorize()将分类变量分解为[0,1... number_of_cats-1],因此X[0][:, n:]中的数据是与其索引对应的数字。
我是否需要为每个分类特征分别创建一个Embedding?例如,为每个x_*n, x_*n+1, ..., x_*m创建一个嵌入?
如果是,我如何将其放入Keras代码中?

model = Sequential()

model.add(Embedding(?, ?, input_length=variable)) # How do I feed the data into this embedding? Only the categorical inputs.

model.add(LSTM())
model.add(Dense())
model.add.Activation('sigmoid')
model.compile()

model.fit_generator() # fits the `X[i]` one by one of variable length sequences.

我的解决方案:

看起来像这样的东西:



我可以在每一个分类特征(m-n)上训练Word 2 Vec模型来向量化任何给定的值。例如,伦敦将在3维中向量化。让我们假设我使用3维嵌入。然后我将所有东西放回X矩阵,现在有n + 3(n-m),并使用LSTM模型来训练它?
我只是觉得应该有一个更简单/更聪明的方法。

u59ebvdq

u59ebvdq1#

正如你提到的,一种解决方案是对分类数据进行独热编码(或者甚至以基于索引的格式使用它们),并将它们与数值数据一起沿着馈送到LSTM层。当然,这里也可以有两个LSTM层,一个用于处理数值数据,另一个用于处理分类数据(以独热编码格式或基于索引的格式),然后合并它们的输出。
另一种解决方案是为每个分类数据提供一个单独的嵌入层。每个嵌入层都可以有自己的嵌入维度(如上所述,您可以有多个LSTM层分别处理数值和分类特征):

num_cats = 3 # number of categorical features
n_steps = 100 # number of timesteps in each sample
n_numerical_feats = 10 # number of numerical features in each sample
cat_size = [1000, 500, 100] # number of categories in each categorical feature
cat_embd_dim = [50, 10, 100] # embedding dimension for each categorical feature

numerical_input = Input(shape=(n_steps, n_numerical_feats), name='numeric_input')
cat_inputs = []
for i in range(num_cats):
    cat_inputs.append(Input(shape=(n_steps,1), name='cat' + str(i+1) + '_input'))

cat_embedded = []
for i in range(num_cats):
    embed = TimeDistributed(Embedding(cat_size[i], cat_embd_dim[i]))(cat_inputs[i])
    cat_embedded.append(embed)
    
cat_merged = concatenate(cat_embedded)
cat_merged = Reshape((n_steps, -1))(cat_merged)
merged = concatenate([numerical_input, cat_merged])
lstm_out = LSTM(64)(merged)

model = Model([numerical_input] + cat_inputs, lstm_out)
model.summary()

字符串
以下是模型摘要:

Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
cat1_input (InputLayer)         (None, 100, 1)       0                                            
__________________________________________________________________________________________________
cat2_input (InputLayer)         (None, 100, 1)       0                                            
__________________________________________________________________________________________________
cat3_input (InputLayer)         (None, 100, 1)       0                                            
__________________________________________________________________________________________________
time_distributed_1 (TimeDistrib (None, 100, 1, 50)   50000       cat1_input[0][0]                 
__________________________________________________________________________________________________
time_distributed_2 (TimeDistrib (None, 100, 1, 10)   5000        cat2_input[0][0]                 
__________________________________________________________________________________________________
time_distributed_3 (TimeDistrib (None, 100, 1, 100)  10000       cat3_input[0][0]                 
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 100, 1, 160)  0           time_distributed_1[0][0]         
                                                                 time_distributed_2[0][0]         
                                                                 time_distributed_3[0][0]         
__________________________________________________________________________________________________
numeric_input (InputLayer)      (None, 100, 10)      0                                            
__________________________________________________________________________________________________
reshape_1 (Reshape)             (None, 100, 160)     0           concatenate_1[0][0]              
__________________________________________________________________________________________________
concatenate_2 (Concatenate)     (None, 100, 170)     0           numeric_input[0][0]              
                                                                 reshape_1[0][0]                  
__________________________________________________________________________________________________
lstm_1 (LSTM)                   (None, 64)           60160       concatenate_2[0][0]              
==================================================================================================
Total params: 125,160
Trainable params: 125,160
Non-trainable params: 0
__________________________________________________________________________________________________


还有另一种解决方案可以尝试:只为所有分类特征使用一个嵌入层。它涉及一些预处理:您需要重新索引所有类别,使它们彼此不同。例如,第一分类特征中的类别将从1到size_first_cat编号,然后第二分类特征中的类别将从size_first_cat + 1编号到size_first_cat + size_second_cat等等。然而,在这个解决方案中,所有的分类特征都将具有相同的嵌入维数,因为我们只使用一个嵌入层。

**更新:**现在我想起来了,你也可以在数据预处理阶段甚至在模型中重塑分类特征,以摆脱TimeDistributed层和Reshape层(这可能会提高训练速度):

numerical_input = Input(shape=(n_steps, n_numerical_feats), name='numeric_input')
cat_inputs = []
for i in range(num_cats):
    cat_inputs.append(Input(shape=(n_steps,), name='cat' + str(i+1) + '_input'))

cat_embedded = []
for i in range(num_cats):
    embed = Embedding(cat_size[i], cat_embd_dim[i])(cat_inputs[i])
    cat_embedded.append(embed)

cat_merged = concatenate(cat_embedded)
merged = concatenate([numerical_input, cat_merged])
lstm_out = LSTM(64)(merged)

model = Model([numerical_input] + cat_inputs, lstm_out)


型号汇总:

__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
==================================================================================================
 cat1_input (InputLayer)     [(None, 100)]                0         []                            
                                                                                                  
 cat2_input (InputLayer)     [(None, 100)]                0         []                            
                                                                                                  
 cat3_input (InputLayer)     [(None, 100)]                0         []                            
                                                                                                  
 embedding_14 (Embedding)    (None, 100, 50)              50000     ['cat1_input[0][0]']          
                                                                                                  
 embedding_15 (Embedding)    (None, 100, 10)              5000      ['cat2_input[0][0]']          
                                                                                                  
 embedding_16 (Embedding)    (None, 100, 100)             10000     ['cat3_input[0][0]']          
                                                                                                  
 numeric_input (InputLayer)  [(None, 100, 10)]            0         []                            
                                                                                                  
 concatenate_26 (Concatenat  (None, 100, 160)             0         ['embedding_14[0][0]',        
 e)                                                                  'embedding_15[0][0]',        
                                                                     'embedding_16[0][0]']        
                                                                                                  
 concatenate_27 (Concatenat  (None, 100, 170)             0         ['numeric_input[0][0]',       
 e)                                                                  'concatenate_26[0][0]']      
                                                                                                  
 lstm_5 (LSTM)               (None, 64)                   60160     ['concatenate_27[0][0]']      
                                                                                                  
==================================================================================================
Total params: 125160 (488.91 KB)
Trainable params: 125160 (488.91 KB)
Non-trainable params: 0 (0.00 Byte)
__________________________________________________________________________________________________


至于拟合模型,你需要分别为每个输入层提供其对应的numpy数组,例如:

X_tr_numerical = X_train[:,:,:n_numerical_feats]

# extract categorical features: you can use a for loop to this as well.
# note that we reshape categorical features to make them consistent with the updated solution
X_tr_cat1 = X_train[:,:,cat1_idx].reshape(-1, n_steps) 
X_tr_cat2 = X_train[:,:,cat2_idx].reshape(-1, n_steps)
X_tr_cat3 = X_train[:,:,cat3_idx].reshape(-1, n_steps)

# don't forget to compile the model ...

# fit the model
model.fit([X_tr_numerical, X_tr_cat1, X_tr_cat2, X_tr_cat3], y_train, ...)

# or you can use input layer names instead
model.fit({'numeric_input': X_tr_numerical,
           'cat1_input': X_tr_cat1,
           'cat2_input': X_tr_cat2,
           'cat3_input': X_tr_cat3}, y_train, ...)


如果你想使用fit_generator(),没有区别:

# if you are using a generator
def my_generator(...):
     
    # prep the data ...

    yield [batch_tr_numerical, batch_tr_cat1, batch_tr_cat2, batch_tr_cat3], batch_tr_y

    # or use the names
    yield {'numeric_input': batch_tr_numerical,
           'cat1_input': batch_tr_cat1,
           'cat2_input': batch_tr_cat2,
           'cat3_input': batch_tr_cat3}, batch_tr_y

model.fit_generator(my_generator(...), ...)

# or if you are subclassing Sequence class
class MySequnece(Sequence):
    def __init__(self, x_set, y_set, batch_size):
        # initialize the data

    def __getitem__(self, idx):
        # fetch data for the given batch index (i.e. idx)

        # same as the generator above but use `return` instead of `yield`

model.fit_generator(MySequence(...), ...)

b4wnujal

b4wnujal2#

我能想到的另一个解决方案是,你可以在将其提供给lstm之前,将数值特征(在标准化之后)和分类特征结合在一起。
在反向传播期间,允许梯度仅在嵌入层中流动,因为默认情况下梯度将在两个分支中流动。

相关问题