
~/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py in spawn(fn, args, nprocs, join, daemon, start_method)
    198                ' torch.multiprocessing.start_process(...)' % start_method)
    199         warnings.warn(msg)
--> 200     return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')

~/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
    157     # Loop on join until it returns True or raises an exception.
--> 158     while not context.join():
    159         pass

~/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    111                 raise Exception(
    112                     "process %d terminated with exit code %d" %
--> 113                     (error_index, exitcode)
    114                 )

Exception: process 0 terminated with exit code 1



import numpy as np
import torch
from torch.multiprocessing import Pool, set_start_method, spawn

X = np.array([[1, 3, 2, 3], [2, 3, 5, 6], [1, 2, 3, 4]])
X = torch.DoubleTensor(X)

def X_power_func(j):
#     print(j)
    X_power = X**j
#     print(X_power)
    return X_power

if __name__ == '__main__':
    spawn(X_power_func, nprocs=3)

上面代码参考:Exception: process 0 terminated with exit code 1error when usingtorch.multiprocessing.spawn` to parallelize over multiple GPUs

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
AttributeError: Can't get attribute 'X_power_func' on <module '__main__' (built-in)>
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
AttributeError: Can't get attribute 'X_power_func' on <module '__main__' (built-in)>
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
AttributeError: Can't get attribute 'X_power_func' on <module '__main__' (built-in)>
Exception                                 Traceback (most recent call last)
/tmp/ipykernel_173/3160031924.py in <module>
     16 if __name__ == '__main__':
---> 17     spawn(X_power_func, nprocs=3)

~/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py in spawn(fn, args, nprocs, join, daemon, start_method)
    198                ' torch.multiprocessing.start_process(...)' % start_method)
    199         warnings.warn(msg)
--> 200     return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')

~/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py in start_processes(fn, args, nprocs, join, daemon, start_method)
    157     # Loop on join until it returns True or raises an exception.
--> 158     while not context.join():
    159         pass

~/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py in join(self, timeout)
    111                 raise Exception(
    112                     "process %d terminated with exit code %d" %
--> 113                     (error_index, exitcode)
    114                 )

Exception: process 1 terminated with exit code 1

解决方法: create a separate file for func

  1. https://medium/@grvsinghal/speed-up-your-python-code-using-multiprocessing-on-windows-and-jupyter-or-ipython-2714b49d6fac
  2. Jupyter Notebook PyTorch Multiprocessing
  3. https://discuss.pytorch/t/distributeddataparallel-on-terminal-vs-jupyter-notebook/101404


但是我决定直接在terminal下运行,不在jupyter下运行,避开jupyter没有create a separate file for func带来的报错,得到如下结果


        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

Traceback (most recent call last):
  File "test_main_ddp.py", line 543, in <module>
    mp.spawn(main_worker, nprocs=args.ngpus_per_node, args=(args.ngpus_per_node, args))
  File "/root/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/root/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/root/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 113, in join
    (error_index, exitcode)
Exception: process 0 terminated with exit code 1

参考链接:RuntimeError:An attempt has been made to start a new process before the

Pytorch-lightning: 例外:DDP时,进程0以退出代码1终止
解决方法:根据traceback的报错信息,把line 543中的

mp.spawn(main_worker, nprocs=args.ngpus_per_node, args=(args.ngpus_per_node, args))


    if __name__=="__main__":
        mp.spawn(main_worker, nprocs=args.ngpus_per_node, args=(args.ngpus_per_node, args))

出错原因:因为多线程程序要放在主函数中训练。这样就不报Exception: process 0 terminated with exit code 1的错了。


Jupter torch.multiprocessing.spawn()报错:Exception: process 0 terminated with exit