Hi
I am trying to train SER using custom dataset annotated as here https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.6/doc/doc_en/kie_en.md
SimpleDataset

config : configs/kie/vi_layoutxlm/ser_vi_layoutxlm.yaml (modified according to my dataset (check the very end of this issue))

Receiving error : (full error at the end)
RuntimeError: (PreconditionNotMet) The meta data must be valid when call the mutable data function. (at /Users/paddle/xly/workspace/0e927451-9b24-4f4c-91d8-7f5f62245df4/Paddle/paddle/phi/core/dense_tensor.cc:105)
[operator < not_equal > error]

Tracing this error I have found:
loss used is vqa_token_layoutlm_loss

labels shape in loss is [8, 512]
predicts shape in loss is [8, 512, 45]
and attention mask received from the batch :
attention_mask shape in loss is [8, 512]
attention_mask is Tensor(shape=[4096], dtype=int64, place=Place(cpu), stop_gradient=True,
[0, 0, 0, ..., 0, 0, 0])
attention_mask is all zeroes, therefore :
active_loss shape in loss is [4096]
active_loss is all False - > no True element -> index sent to active_output and active_label are 0
active_output shape in loss is [0, 45]
active_label shape in loss is [0]

Question: why attention_mask = batch[2] is all zeros?

Output exceeds the size limit. Open the full output data in a text editor

RuntimeError Traceback (most recent call last)
Cell In [41], line 23
21 seed = config['Global']['seed'] if 'seed' in config['Global'] else 1024
22 train.set_seed(seed)
---> 23 train.main(config, device, logger, vdl_writer)

File ~/Desktop/OCR/code/PaddleOCR/tools/train.py:175, in main(config, device, logger, vdl_writer)
173 model = paddle.DataParallel(model)
174 # start train
--> 175 program.train(config, train_dataloader, valid_dataloader, device, model,
176 loss_class, optimizer, lr_scheduler, post_process_class,
177 eval_class, pre_best_model_dict, logger, vdl_writer, scaler,amp_level, amp_custom_black_list)

File ~/Desktop/OCR/code/PaddleOCR/tools/program.py:307, in train(config, train_dataloader, valid_dataloader, device, model, loss_class, optimizer, lr_scheduler, post_process_class, eval_class, pre_best_model_dict, logger, log_writer, scaler, amp_level, amp_custom_black_list)
301 preds = model(images)
302 # print("preds size:{}".format(len(preds)))
303 # print("preds {}".format(preds))
304 # print("batch size:{}".format(len(batch)))
305 # print("batch :{}".format(batch))
--> 307 loss = loss_class(preds, batch)
308 avg_loss = loss['loss']
309 avg_loss.backward()

File /opt/homebrew/Caskroom/miniforge/base/envs/paddle_env/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py:930, in Layer.call(self, *inputs,**kwargs)
928 return self.forward(*inputs,**kwargs)
929 else:
--> 930 return self._dygraph_call_func(*inputs,**kwargs)

File /opt/homebrew/Caskroom/miniforge/base/envs/paddle_env/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py:915, in Layer._dygraph_call_func(self, *inputs,**kwargs)
913 outputs = self.forward(*inputs,**kwargs)
914 else:
--> 915 outputs = self.forward(*inputs,**kwargs)
917 for forward_post_hook in self._forward_post_hooks.values():
918 hook_result = forward_post_hook(self, inputs, outputs)

File ~/Desktop/OCR/code/PaddleOCR/ppocr/losses/vqa_token_layoutlm_loss.py:53, in VQASerTokenLayoutLMLoss.forward(self, predicts, batch)
51 print("active_output shape in loss is {}".format(active_output.shape))
52 print("active_label shape in loss is {}".format(active_label.shape))
---> 53 loss = self.loss_class(active_output, active_label)
...
298 math_op = getattr(_C_ops, op_type)
--> 299 return math_op(self, other_var, 'axis', axis)

Global:
use_gpu: False
epoch_num: &epoch_num 200
log_smooth_window: 10
print_batch_step: 10
save_model_dir: ./output/ser_vi_layoutxlm_permission
save_epoch_step: 2000

evaluation is run every 10 iterations after the 0th iteration

eval_batch_step: [ 0, 19 ]
cal_metric_during_train: False
save_inference_dir:
use_visualdl: False
seed: 2022
distributed: False

infer_img: ppstructure/docs/kie/input/zh_val_42.jpg

if you want to predict using the groundtruth ocr info,

you can use the following config

infer_img: train_data/XFUND/zh_val/val.json

infer_mode: False

save_res_path: ./output/ser/permission/res
kie_rec_model_dir:
kie_det_model_dir:

Architecture:
model_type: kie
algorithm: &algorithm "LayoutXLM"
Transform:
Backbone:
name: LayoutXLMForSer
pretrained: True
checkpoints:

one of base or vi

mode: vi
num_classes: &num_classes 45

Loss:
name: VQASerTokenLayoutLMLoss
num_classes: *num_classes
key: "backbone_out"

Optimizer:
name: AdamW
beta1: 0.9
beta2: 0.999
lr:
name: Linear
learning_rate: 0.00005
epochs: *epoch_num
warmup_epoch: 2
regularizer:
name: L2
factor: 0.00000

PostProcess:
name: VQASerTokenLayoutLMPostProcess
class_path: &class_path train_data/permission/class_list.txt

&class_path train_data/XFUND/class_list_xfun.txt

Metric:
name: VQASerTokenMetric
main_indicator: hmean

Train:
dataset:
name: SimpleDataSet
data_dir:

train_data/XFUND/zh_train/image

train_data/permission/train
label_file_list: 
  #- train_data/permission/train.json
   - train_data/permission/label_kie.txt

- train_data/XFUND/zh_train/train.json

ratio_list: [ 1.0 ]
transforms:
  - DecodeImage: # load image
      img_mode: RGB
      channel_first: False
  - VQATokenLabelEncode: # Class handling label
      contains_re: False
      algorithm: *algorithm
      class_path: *class_path
      use_textline_bbox_info: &use_textline_bbox_info True
      # one of [None, "tb-yx"]
      order_method: &order_method "tb-yx"
  - VQATokenPad:
      max_seq_len: &max_seq_len 512
      return_attention_mask: True
  - VQASerTokenChunk:
      max_seq_len: *max_seq_len
  - Resize:
      size: [224,224]
  - NormalizeImage:
      scale: 1
      mean: [ 123.675, 116.28, 103.53 ]
      std: [ 58.395, 57.12, 57.375 ]
      order: 'hwc'
  - ToCHWImage:
  - KeepKeys:
      keep_keys: [ 'input_ids', 'bbox', 'attention_mask', 'token_type_ids', 'image', 'labels'] # dataloader will return list in this order

loader:
shuffle: True
drop_last: False
batch_size_per_card: 8
num_workers: 4

Eval:
dataset:
name: SimpleDataSet
data_dir:

train_data/XFUND/zh_val/image

train_data/permission/train

data_dir: train_data/permission/val

label_file_list:

- train_data/XFUND/zh_val/val.json

- train_data/permission/label_kie.txt
  # - train_data/permission/val.json
transforms:
  - DecodeImage: # load image
      img_mode: RGB
      channel_first: False
  - VQATokenLabelEncode: # Class handling label
      contains_re: False
      algorithm: *algorithm
      class_path: *class_path
      use_textline_bbox_info: *use_textline_bbox_info
      order_method: *order_method
  - VQATokenPad:
      max_seq_len: *max_seq_len
      return_attention_mask: True
  - VQASerTokenChunk:
      max_seq_len: *max_seq_len
  - Resize:
      size: [224,224]
  - NormalizeImage:
      scale: 1
      mean: [ 123.675, 116.28, 103.53 ]
      std: [ 58.395, 57.12, 57.375 ]
      order: 'hwc'
  - ToCHWImage:
  - KeepKeys:
      keep_keys: [ 'input_ids', 'bbox', 'attention_mask', 'token_type_ids', 'image', 'labels'] # dataloader will return list in this order

loader:
shuffle: False
drop_last: False
batch_size_per_card: 8
num_workers: 4

PaddleOCR training KIE/SER: attention mask all 0's