描述bug
我正在尝试按照LayoutLM README中描述的方式训练LayoutLM序列标注模型。训练过程中出现了一个StopIteration
异常。
问题出现在使用以下两种情况时:
- 官方示例脚本:(详细说明如下)
- 我自己的修改过的脚本:(详细说明如下)
重现问题
我像下面这样设置了环境。
conda create -n layoutlm
conda activate layoutlm
conda install -c creditx gcc-7
conda install pytorch cudatoolkit=10.1 -c pytorch
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
然后按照README中描述的预处理了示例FUNSD数据,接着运行了以下命令。
python /home/wmcneill/src/unilm/layoutlm/examples/seq_labeling/run_seq_labeling.py \
--data_dir /home/wmcneill/src/unilm/layoutlm/examples/seq_labeling/data \
--model_type layoutlm \
--model_name_or_path /home/wmcneill/experiment/layoutlm/layoutlm-large-uncased \
--do_lower_case \
--max_seq_length 512 \
--do_train \
--num_train_epochs 100.0 \
--logging_steps 10 \
--save_steps -1 \
--output_dir /home/wmcneill/experiment/layoutlm/FUNSD.layoutlm.model \
--labels /home/wmcneill/src/unilm/layoutlm/examples/seq_labeling/data/labels.txt \
--per_gpu_train_batch_size 16 \
--per_gpu_eval_batch_size 16 \
--fp16
在开始训练后不久,我看到了以下错误。
Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.
Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Iteration: 0%| | 0/5 [00:02<?, ?it/s]
Epoch: 0%| | 0/1 [00:02<?, ?it/s]
Traceback (most recent call last):
File "/home/wmcneill/src/unilm/layoutlm/examples/seq_labeling/run_seq_labeling.py", line 811, in <module>
main()
File "/home/wmcneill/src/unilm/layoutlm/examples/seq_labeling/run_seq_labeling.py", line 703, in main
global_step, tr_loss = train(
File "/home/wmcneill/src/unilm/layoutlm/examples/seq_labeling/run_seq_labeling.py", line 219, in train
outputs = model(**inputs)
File "/home/wmcneill/anaconda3/envs/layoutlm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/wmcneill/anaconda3/envs/layoutlm/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/wmcneill/anaconda3/envs/layoutlm/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/wmcneill/anaconda3/envs/layoutlm/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/home/wmcneill/anaconda3/envs/layoutlm/lib/python3.8/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/wmcneill/anaconda3/envs/layoutlm/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/home/wmcneill/anaconda3/envs/layoutlm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/wmcneill/src/unilm/layoutlm/layoutlm/modeling/layoutlm.py", line 211, in forward
outputs = self.bert(
File "/home/wmcneill/anaconda3/envs/layoutlm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/wmcneill/src/unilm/layoutlm/layoutlm/modeling/layoutlm.py", line 143, in forward
dtype=next(self.parameters()).dtype
StopIteration
预期行为
我希望能够训练一个模型,并将其创建在FUNSD.layoutlm.model
目录中。我可以在没有GPU的其他机器上使用相同的设置完成这个操作。
- 平台:
- Python版本:3.8.5
- PyTorch版本(GPU?):1.6
- CentOS Linux release 7.6.1810 (Core)
- CUDA 10.1
- VIDIA-SMI 450.57驱动程序版本:450.57
4条答案
按热度按时间7xzttuei1#
你可以在脚本开始时运行
CUDA_LAUNCH_BLOCKING=1
,这样可以提供更多信息。5fjcxozz2#
任何人都遇到了这个错误...看看是否可以在单个GPU上运行。
就我而言,我可以在单个GPU上运行,但是当我使用多个使用DataParallel模型的GPU时,我仍然会得到相同的错误。
o7jaxewo3#
在这里遇到了相同的问题,我有多于一个GPU。
kmynzznz4#
将PyTorch降级到1.4.0或修改源代码,使其在forward中不使用self.parameters(),可以保存next(self.parameters()).dtype到init中,并在forward中使用保存的dtype。