目录

  • 报错情况
  • 问题排查
  • 总结

报错情况

系统环境/System Environment:
ubuntu16
版本号/Version:Paddle:2.3.1 PaddleOCR:git clone version latest
问题相关组件/Related components:
运行SER+RE联合推理报错
运行指令/Command Code:

export CUDA_VISIBLE_DEVICES=0
python3 tools/infer_vqa_token_ser_re.py -c configs/vqa/re/layoutxlm.yml -o Architecture.Backbone.checkpoints=pretrain/re_LayoutXLM_xfun_zh/ Global.infer_img=doc/vqa/input/zh_val_21.jpg -c_ser configs/vqa/ser/layoutxlm.yml -o_ser Architecture.Backbone.checkpoints=pretrain/ser_LayoutXLM_xfun_zh/

完整报错/Complete Error Message:

- Traceback (most recent call last):
  File "tools/infer_vqa_token_ser_re.py", line 193, in <module>
    result = ser_re_engine(img_path)
  File "tools/infer_vqa_token_ser_re.py", line 135, in __call__
    ser_results, ser_inputs = self.ser_engine(img_path)
  File "/data1/liushu/test_PPOCR/PaddleOCR/tools/infer_vqa_token_ser.py", line 98, in __call__
    batch = transform(data, self.ops)
  File "/data1/liushu/test_PPOCR/PaddleOCR/ppocr/data/imaug/__init__.py", line 51, in transform
    data = op(data)
  File "/data1/liushu/test_PPOCR/PaddleOCR/ppocr/data/imaug/label_ops.py", line 885, in __call__
    ocr_info = self._load_ocr_info(data)
  File "/data1/liushu/test_PPOCR/PaddleOCR/ppocr/data/imaug/label_ops.py", line 988, in _load_ocr_info
    ocr_result = self.ocr_engine.ocr(data['image'], cls=False)
  File "/data1/liushu/test_PPOCR/PaddleOCR/paddleocr.py", line 480, in ocr
    dt_boxes, rec_res = self.__call__(img, cls)
  File "/data1/liushu/test_PPOCR/PaddleOCR/tools/infer/predict_system.py", line 69, in __call__
    dt_boxes, elapse = self.text_detector(img)
  File "/data1/liushu/test_PPOCR/PaddleOCR/tools/infer/predict_det.py", line 218, in __call__
    self.predictor.run()
OSError: In user code:

    File "tools/export_model.py", line 172, in <module>
      main()
    File "tools/export_model.py", line 165, in main
      sub_model_save_path, logger)
    File "tools/export_model.py", line 99, in export_single_model
      paddle.jit.save(model, save_path)
    File "<decorator-gen-101>", line 2, in save
      
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/wrapped_decorator.py", line 25, in __impl__
      return wrapped_func(*args, **kwargs)
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/base.py", line 51, in __impl__
      return func(*args, **kwargs)
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/jit.py", line 744, in save
      inner_input_spec)
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/dygraph_to_static/program_translator.py", line 517, in concrete_program_specify_input_spec
      *desired_input_spec)
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/dygraph_to_static/program_translator.py", line 427, in get_concrete_program
      concrete_program, partial_program_layer = self._program_cache[cache_key]
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/dygraph_to_static/program_translator.py", line 723, in __getitem__
      self._caches[item] = self._build_once(item)
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/dygraph_to_static/program_translator.py", line 714, in _build_once
      **cache_key.kwargs)
    File "<decorator-gen-99>", line 2, in from_func_spec
      
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/wrapped_decorator.py", line 25, in __impl__
      return wrapped_func(*args, **kwargs)
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/base.py", line 51, in __impl__
      return func(*args, **kwargs)
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/dygraph_to_static/program_translator.py", line 662, in from_func_spec
      outputs = static_func(*inputs)
    File "/paddle/debug/PaddleOCR/ppocr/modeling/architectures/base_model.py", line 79, in forward
      x = self.backbone(x)
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 917, in __call__
      return self._dygraph_call_func(*inputs, **kwargs)
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 907, in _dygraph_call_func
      outputs = self.forward(*inputs, **kwargs)
    File "/paddle/debug/PaddleOCR/ppocr/modeling/backbones/det_mobilenet_v3.py", line 146, in forward
      x = self.conv(x)
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 917, in __call__
      return self._dygraph_call_func(*inputs, **kwargs)
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 907, in _dygraph_call_func
      outputs = self.forward(*inputs, **kwargs)
    File "/paddle/debug/PaddleOCR/ppocr/modeling/backbones/det_mobilenet_v3.py", line 179, in forward
      x = self.conv(x)
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 917, in __call__
      return self._dygraph_call_func(*inputs, **kwargs)
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 907, in _dygraph_call_func
      outputs = self.forward(*inputs, **kwargs)
    File "/usr/local/lib/python3.7/site-packages/paddle/nn/layer/conv.py", line 677, in forward
      use_cudnn=self._use_cudnn)
    File "/usr/local/lib/python3.7/site-packages/paddle/nn/functional/conv.py", line 148, in _conv_nd
      type=op_type, inputs=inputs, outputs=outputs, attrs=attrs)
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/layer_helper.py", line 43, in append_op
      return self.main_program.current_block().append_op(*args, **kwargs)
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/framework.py", line 3184, in append_op
      attrs=kwargs.get("attrs", None))
    File "/usr/local/lib/python3.7/site-packages/paddle/fluid/framework.py", line 2224, in __init__
      for frame in traceback.extract_stack():

    ExternalError: CUDNN error(4), CUDNN_STATUS_INTERNAL_ERROR. 
      [Hint: 'CUDNN_STATUS_INTERNAL_ERROR'.  An internal cuDNN operation failed.  ] (at /paddle/paddle/phi/backends/gpu/gpu_resources:211)
      [operator < conv2d_fusion > error]

问题排查

之前并不存在这个情况,只运行OCR+SER推理运行正常,但运行OCR+SER+RE联合推理,就报错。

  1. 第一反应,网上搜索,发现解决方法主要分为两类,1) 版本与环境不兼容,需要重新安装。第二种是我排查完1)后发现的,说可能是由于资源不足的问题,导致报错。
  2. 单独运行SER没有报错,说明配置环境、版本没有问题。
  3. 是否是由于我修改代码导致的,重新clone github上PPOCR的代码,再次运行,还是失败。
  4. 2的失败,说明不是代码的问题,排除了代码和环境的问题,就剩下资源情况,把GPU显存释放掉了一些,没想到就成功了。

总结

别看排查问题步骤写的很简单,但是花费了3个小时进行解决。太让人泪目了。
不过,这次问题解决也让我明白了,报错无非是由三个方面出现的,1)逻辑错误、矩阵运算错误,2)版本(环境配置),3)计算资源
当然只是简单的划分,其实每一类都存细小的分类。
后面,可以按照这个思路总结一下,自己遇到问题的类别,这样bug就会越来越少了,嘻嘻嘻嘻。

更多推荐

【深度学习框架-Paddle】ExternalError: CUDNN error(4), CUDNN_STATUS_INTERNAL_ERROR.报错原因