ubuntu Slurm:“An error occurred in MPI_Init on a NULL communicator”

velaa5lx  于 2023-06-05  发布在  其他
关注(0)|答案(1)|浏览(383)

**概述:**我尝试在单节点计算工作站上安装Slurm,以便于用户之间有组织地提交作业。这主要是为了运行自编码的并行程序(在此:binary),通过mpirun -np n binary执行时成功运行。
版本:

  • 操作系统:Ubuntu 20.04.3(up to date)
  • Slurm:19.05.5-1(通过apt)
  • OpenMPI:4.0.3(通过apt)
    **Slurm安装/配置:**我按照here的说明安装和配置了Slurm,并附上了我的slurm.conf文件。完成后,Slurm似乎可以正常工作,例如,sinfo显示:
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1   idle nodename

并且srun -n 2 echo "Look at me"产生:

Look at me
Look at me

并且,当使用命令srun -n8 sleep 10时,squeue显示:

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
   44     debug    sleep username  R       0:04      1 nodename

**错误:**我尝试通过sbatch runslurm.sh命令使用shell脚本runslurm.sh运行程序binary。下面是shell脚本:

#!/bin/bash
#SBATCH -J testjob
#SBATCH -e error.%A
#SBATCH -o output.%A
#SBATCH -N 1
#SBATCH -n 2

srun binary

作业将立即完成,并只写入一个错误文件,其中包含:

*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[nodename:470173] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and n
ot able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[nodename:470174] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and n
ot able to guarantee that all other processes were killed!
srun: error: nodename: tasks 0-1: Exited with exit code 1

**其他尝试:**我已经在shell脚本的srun命令中尝试了所有可用的MPI类型(即srun --mpi=XXX binary,其中XXX通过srun --mpi=list确定:

srun: MPI types are...
srun: pmix_v3
srun: pmix
srun: openmpi
srun: none
srun: pmi2

使用--mpi=openmpi--mpi=none--mpi=pmi2时,错误与上述相同。
使用--mpi=pmix_v3--mpi=pmix时,两种情况下的错误类似:

srun: error: (null) [0] /mpi_pmix.c:133 [init] mpi/pmix: ERROR: pmi/pmix: can not load PMIx library
srun: error: Couldn't load specified plugin name for mpi/pmix_v3: Plugin init() callback failed
srun: error: cannot create mpi context for mpi/pmix_v3
srun: error: invalid MPI type 'pmix_v3', --mpi=list for acceptable types

**其他信息:**注意OpenMPI似乎是针对Slurm安装的(如果需要,我可以复制所有ompi_info):

ompi_info | grep slurm
  Configure command line: ...'--with-slurm'...
                 MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.0.3)
                 MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
                 MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.0.3)
              MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.0.3)

以及pmix:

ompi_info | grep pmix
  Configure command line: ...'--with-pmix=/usr/lib/x86_64-linux-gnu/pmix'...
                MCA pmix: ext3x (MCA v2.1.0, API v2.0.0, Component v4.0.3)
                MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v4.0.3)
                MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v4.0.3)

下面是slurm.conf

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
SlurmctldHost=nodename
#SlurmctldHost=
#
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=5000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=root
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/affinity
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
#SlurmctldLogFile=
SlurmdDebug=info
#SlurmdLogFile=
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=nodename Sockets=1 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
PartitionName=debug Nodes=nodename Default=YES MaxTime=INFINITE State=UP
vsaztqbk

vsaztqbk1#

我在一组docker容器上遇到了这个问题,我正在设置模拟运行Ubuntu 20.04的集群。我通过安装这些包“libpmi 1-pmix libpmi 2-pmix libpmix-dev libpmix 2 libopenmpi-dev libopenmpi 3 libpmi-pmix-dev”并在slurm. conf中设置“MpiDefault=pmix”来修复它。

相关问题