pandas ValueError:当运行tf.data.Dataset.from_tensor_slice((dict(df),binary_labels))时,维度不兼容

kg7wmglp  于 2023-06-20  发布在  其他
关注(0)|答案(1)|浏览(81)

我正在研究一个使用一系列结构化数据集的多标签分类问题。我的代码与这里的keras多标签分类示例非常相似:www.example.com https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/nlp/ipynb/multi_label_classification.ipynb#scrollTo=CeG8XhH8Yo-w
TensorFlow版本:2.11.0 Python:3.9.16麻木:1.22.4Pandas:1.4.4 SciKitLearn:1.2.2
我创建了一个pandas Dataframe ,但无法通过tf. data. Dataset. from_tensor_slices((dict(df),binary_labels))函数传递它。我收到错误:"ValueError:尺寸4113和2689不兼容”
下面的代码片段。

  • pandas Dataframe 已经过全面检查,目标帧中的所有缺失值都已解决,其他列中的缺失值也已删除,重复项也已删除。

我删除了发生次数少于1次的所有目标值,并通过分层抽样对数据集进行了分割。

# Split using stk
# Split on 60, 20, 20 ratios
test_split = 0.4

# Initial train and test split
train, test = train_test_split(dataframe,
                               test_size = test_split,
                               stratify = dataframe['target'].values
                              )

# Splitting into testing and validation test sets
val = test.sample(frac=0.5)
test.drop(val.index, inplace=True)

这很好用,我有4113个训练示例,1372个验证示例和1371个测试示例。然后我对标签进行预处理。

# Turn the data into a ragged tensor
labels = train['target'].values
print('Label Length: ', len(labels))
labels = tf.ragged.constant(labels)

# Use the string lookup layer to multi-hot encode the labels
lookup = tf.keras.layers.StringLookup(max_tokens=None, output_mode = 'multi_hot')
lookup.adapt(labels)
vocab = lookup.get_vocabulary()

def decode_multi_hot(encoded_labels):
    """ Function to decode/reverse the multi-hot encoded label to a tuple of vocab terms for one label """
    hot_indices = np.argwhere(encoded_labels == 1.0)[..., 0]
    return np.take(vocab, hot_indices)

print('vocab length: ', len(vocab), lookup.vocabulary_size())
print("Vocabulary: \n", vocab)

其结果是具有长度的标签:4113,等于数据集,vocab长度为2689,这意味着我在训练集中有2689个唯一的vocab术语。
当我运行我的数据集创建函数时,问题出现了:

def df_to_dataset(dataframe, shuffle=True, batch_size=32):
    df = dataframe.copy()
    labels = tf.ragged.constant(df['target'].values)
    
    # labels = df['target'].values
    print('Label Length: ', len(labels))
    
    binary_labels = lookup(labels).numpy()

    # Identify the values in the dataframe
    feature_items = ['Feature1', 'Feature2', 'Feature3']
    df = {key: np.array(value)[:,np.newaxis] for key, value in dataframe[feature_items].items()}
    
    ds = tf.data.Dataset.from_tensor_slices((dict(df), binary_labels))
    
    # Cache the Dataset in Memory
    ds = ds.cache()
    if shuffle:
        ds = ds.shuffle(buffer_size=len(dataframe))
    ds = ds.batch(batch_size)
    ds = ds.prefetch(auto)

    return ds

当我运行train_ds = df_to_dataset(train, batch_size=5)时,我得到以下错误:

Label Length:  4113
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[43], line 2
      1 # Checking the format of the data the input pipeline function returns
----> 2 train_ds = df_to_dataset(train, batch_size = 5)

Cell In[42], line 22, in df_to_dataset(dataframe, shuffle, batch_size)
     15     df = {key: np.array(value)[:,np.newaxis] for key, value in dataframe[feature_items].items()}
     18     
     19 #     ### Remove next line from final version of code
     20 #     print("labels:", labels)
---> 22     ds = tf.data.Dataset.from_tensor_slices((dict(df), binary_labels))
     31     
     32     # Cache the Dataset in Memory
     33     ds = ds.cache()

File ~\AppData\Roaming\Python\Python39\site-packages\tensorflow\python\data\ops\dataset_ops.py:818, in DatasetV2.from_tensor_slices(tensors, name)
    815 # Loaded lazily due to a circular dependency (dataset_ops ->
    816 # from_tensor_slices_op -> dataset_ops).
    817 from tensorflow.python.data.ops import from_tensor_slices_op  # pylint: disable=g-import-not-at-top
--> 818 return from_tensor_slices_op.from_tensor_slices(tensors, name)

File ~\AppData\Roaming\Python\Python39\site-packages\tensorflow\python\data\ops\from_tensor_slices_op.py:25, in from_tensor_slices(tensors, name)
     24 def from_tensor_slices(tensors, name=None):
---> 25   return TensorSliceDataset(tensors, name=name)

File ~\AppData\Roaming\Python\Python39\site-packages\tensorflow\python\data\ops\from_tensor_slices_op.py:45, in TensorSliceDataset.__init__(self, element, is_files, name)
     42 batch_dim = tensor_shape.Dimension(
     43     tensor_shape.dimension_value(self._tensors[0].get_shape()[0]))
     44 for t in self._tensors[1:]:
---> 45   batch_dim.assert_is_compatible_with(
     46       tensor_shape.Dimension(
     47           tensor_shape.dimension_value(t.get_shape()[0])))
     49 variant_tensor = gen_dataset_ops.tensor_slice_dataset(
     50     self._tensors,
     51     output_shapes=structure.get_flat_tensor_shapes(self._structure),
     52     is_files=is_files,
     53     metadata=self._metadata.SerializeToString())
     54 super(TensorSliceDataset, self).__init__(variant_tensor)

File ~\AppData\Roaming\Python\Python39\site-packages\tensorflow\python\framework\tensor_shape.py:297, in Dimension.assert_is_compatible_with(self, other)
    287 """Raises an exception if `other` is not compatible with this Dimension.
    288 
    289 Args:
   (...)
    294     is_compatible_with).
    295 """
    296 if not self.is_compatible_with(other):
--> 297   raise ValueError("Dimensions %s and %s are not compatible" %
    298                    (self, other))

ValueError: Dimensions 4113 and 2689 are not compatible

我知道错误出在我的标签上,并将数据框架中的标签与数据框架的每一行进行匹配,但我不知道我哪里出错了。我输入的标签与dataframe(4113)的大小相同,但是当我运行(lookup(labels))时,它们返回与标签词汇集(2689)相同的大小。我下面的示例代码与我的数据非常相似,并且在与我的设置相同的情况下工作得很好。我不想减少我的 Dataframe 。任何帮助将不胜感激。

rjjhvcjd

rjjhvcjd1#

我不是Maven。但我也面临着同样的问题,并能够通过使用LabelEncoder解决它。也许这个能帮上忙
最好的问候

相关问题