unilm 在训练LayoutLM样本数据时遇到StopIteration

cunj1qz1  于 2个月前  发布在  其他
关注(0)|答案(4)|浏览(45)

描述bug

我正在尝试按照LayoutLM README中描述的方式训练LayoutLM序列标注模型。训练过程中出现了一个StopIteration异常。
问题出现在使用以下两种情况时:

  • 官方示例脚本:(详细说明如下)
  • 我自己的修改过的脚本:(详细说明如下)
    重现问题

我像下面这样设置了环境。

conda create -n layoutlm
conda activate layoutlm
conda install -c creditx gcc-7
conda install pytorch cudatoolkit=10.1 -c pytorch
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

然后按照README中描述的预处理了示例FUNSD数据,接着运行了以下命令。

python /home/wmcneill/src/unilm/layoutlm/examples/seq_labeling/run_seq_labeling.py  \
       --data_dir /home/wmcneill/src/unilm/layoutlm/examples/seq_labeling/data \
       --model_type layoutlm \
       --model_name_or_path /home/wmcneill/experiment/layoutlm/layoutlm-large-uncased \
       --do_lower_case \
       --max_seq_length 512 \
       --do_train \
       --num_train_epochs 100.0 \
       --logging_steps 10 \
       --save_steps -1 \
       --output_dir /home/wmcneill/experiment/layoutlm/FUNSD.layoutlm.model \
       --labels /home/wmcneill/src/unilm/layoutlm/examples/seq_labeling/data/labels.txt \
       --per_gpu_train_batch_size 16 \
       --per_gpu_eval_batch_size 16 \
       --fp16

在开始训练后不久,我看到了以下错误。

Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Iteration:   0%|                                                                                                                                 | 0/5 [00:02<?, ?it/s]
Epoch:   0%|                                                                                                                                     | 0/1 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "/home/wmcneill/src/unilm/layoutlm/examples/seq_labeling/run_seq_labeling.py", line 811, in <module>
    main()
  File "/home/wmcneill/src/unilm/layoutlm/examples/seq_labeling/run_seq_labeling.py", line 703, in main
    global_step, tr_loss = train(
  File "/home/wmcneill/src/unilm/layoutlm/examples/seq_labeling/run_seq_labeling.py", line 219, in train
    outputs = model(**inputs)
  File "/home/wmcneill/anaconda3/envs/layoutlm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/wmcneill/anaconda3/envs/layoutlm/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/wmcneill/anaconda3/envs/layoutlm/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/wmcneill/anaconda3/envs/layoutlm/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/home/wmcneill/anaconda3/envs/layoutlm/lib/python3.8/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/wmcneill/anaconda3/envs/layoutlm/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/wmcneill/anaconda3/envs/layoutlm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/wmcneill/src/unilm/layoutlm/layoutlm/modeling/layoutlm.py", line 211, in forward
    outputs = self.bert(
  File "/home/wmcneill/anaconda3/envs/layoutlm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/wmcneill/src/unilm/layoutlm/layoutlm/modeling/layoutlm.py", line 143, in forward
    dtype=next(self.parameters()).dtype
StopIteration

预期行为

我希望能够训练一个模型,并将其创建在FUNSD.layoutlm.model目录中。我可以在没有GPU的其他机器上使用相同的设置完成这个操作。

  • 平台:
  • Python版本:3.8.5
  • PyTorch版本(GPU?):1.6
  • CentOS Linux release 7.6.1810 (Core)
  • CUDA 10.1
  • VIDIA-SMI 450.57驱动程序版本:450.57
7xzttuei

7xzttuei1#

你可以在脚本开始时运行CUDA_LAUNCH_BLOCKING=1,这样可以提供更多信息。

5fjcxozz

5fjcxozz2#

任何人都遇到了这个错误...看看是否可以在单个GPU上运行。
就我而言,我可以在单个GPU上运行,但是当我使用多个使用DataParallel模型的GPU时,我仍然会得到相同的错误。

o7jaxewo

o7jaxewo3#

在这里遇到了相同的问题,我有多于一个GPU。

kmynzznz

kmynzznz4#

将PyTorch降级到1.4.0或修改源代码,使其在forward中不使用self.parameters(),可以保存next(self.parameters()).dtype到init中,并在forward中使用保存的dtype。

相关问题