ubuntu Slurm：“An error occurred in MPI_Init on a NULL communicator”

**概述：**我尝试在单节点计算工作站上安装Slurm，以便于用户之间有组织地提交作业。这主要是为了运行自编码的并行程序（在此：binary），通过mpirun -np n binary执行时成功运行。
版本：

操作系统：Ubuntu 20.04.3（up to date）
Slurm：19.05.5-1（通过apt）
OpenMPI：4.0.3（通过apt）
**Slurm安装/配置：**我按照here的说明安装和配置了Slurm，并附上了我的slurm.conf文件。完成后，Slurm似乎可以正常工作，例如，sinfo显示：

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1   idle nodename

并且srun -n 2 echo "Look at me"产生：

Look at me
Look at me

并且，当使用命令srun -n8 sleep 10时，squeue显示：

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
   44     debug    sleep username  R       0:04      1 nodename

**错误：**我尝试通过sbatch runslurm.sh命令使用shell脚本runslurm.sh运行程序binary。下面是shell脚本：

#!/bin/bash
#SBATCH -J testjob
#SBATCH -e error.%A
#SBATCH -o output.%A
#SBATCH -N 1
#SBATCH -n 2

srun binary

作业将立即完成，并只写入一个错误文件，其中包含：

*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[nodename:470173] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and n
ot able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[nodename:470174] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and n
ot able to guarantee that all other processes were killed!
srun: error: nodename: tasks 0-1: Exited with exit code 1

**其他尝试：**我已经在shell脚本的srun命令中尝试了所有可用的MPI类型（即srun --mpi=XXX binary，其中XXX通过srun --mpi=list确定：

srun: MPI types are...
srun: pmix_v3
srun: pmix
srun: openmpi
srun: none
srun: pmi2

使用--mpi=openmpi、--mpi=none或--mpi=pmi2时，错误与上述相同。
使用--mpi=pmix_v3或--mpi=pmix时，两种情况下的错误类似：

srun: error: (null) [0] /mpi_pmix.c:133 [init] mpi/pmix: ERROR: pmi/pmix: can not load PMIx library
srun: error: Couldn't load specified plugin name for mpi/pmix_v3: Plugin init() callback failed
srun: error: cannot create mpi context for mpi/pmix_v3
srun: error: invalid MPI type 'pmix_v3', --mpi=list for acceptable types

**其他信息：**注意OpenMPI似乎是针对Slurm安装的（如果需要，我可以复制所有ompi_info）：

ompi_info | grep slurm
  Configure command line: ...'--with-slurm'...
                 MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.3)
                 MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
                 MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
              MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.0.3)

以及pmix：

ompi_info | grep pmix
  Configure command line: ...'--with-pmix=/usr/lib/x86_64-linux-gnu/pmix'...
                MCA pmix: ext3x (MCA v2.1.0, API v2.0.0, Component v4.0.3)
                MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v4.0.3)
                MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v4.0.3)

下面是slurm.conf：

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
SlurmctldHost=nodename
#SlurmctldHost=
#
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=5000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=root
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/affinity
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
#SlurmctldLogFile=
SlurmdDebug=info
#SlurmdLogFile=
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=nodename Sockets=1 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
PartitionName=debug Nodes=nodename Default=YES MaxTime=INFINITE State=UP

ubuntu Slurm：“An error occurred in MPI_Init on a NULL communicator”

1条答案

相关问题

热门标签

最新问答