The Intel® MPI Library supports the majority of commonly used job schedulers in the HPC field.
The following job schedulers are supported on Linux* OS:
The Hydra Process manager detects Job Schedulers automatically by checking specific environment variables. These variables are used to determine how many nodes were allocated, which nodes, and the number of processes per tasks.
If you use one of these job schedulers, and $PBS_ENVIRONMENT exists with the value PBS_BATCH or PBS_INTERACTIVE, mpirun uses $PBS_NODEFILE as a machine file for mpirun. You do not need to specify the -machinefile option explicitly.
The following is an example of a batch job script:
#PBS -l nodes=4:ppn=4 #PBS -q queue_name cd $PBS_O_WORKDIR mpirun -n 16 ./myprog
The IBM Platform LSF* job scheduler is detected automatically if the $LSB_MCPU_HOSTS and $LSF_BINDIR environment variables are set.
The Hydra process manager uses these variables to determine how many nodes were allocated, which nodes, and the number of processes per tasks. To run processes on the remote nodes, the Hydra process manager uses the blaunch utility by default. This utility is provided by the IBM Platform LSF.
The number of processes, the number of processes per node, and node names may be overridden by the usual Hydra options (-n, -ppn, -hosts).
Examples:
bsub -n 16 mpirun ./myprog bsub -n 16 mpirun -n 2 -ppn 1 ./myprog
If you use the Parallelnavi NQS job scheduler and the $ENVIRONMENT, $QSUB_REQID, $QSUB_NODEINF options are set, the $QSUB_NODEINF file is used as a machine file for mpirun. Also, /usr/bin/plesh is used as remote shell by the process manager during startup.
The Slurm job scheduler can be detected automatically by mpirun and mpiexec. Job scheduler detection is enabled in mpirun by default and enabled in mpiexec if hostnames are not specified. For autodetection, the Hydra process manger uses these environment variables:
Using these variables, Hydra can determine how many nodes were allocated, which nodes, and the number of processes per task. If the Slurm job scheduler was not detected automatically, you can set the I_MPI_HYDRA_RMK=slurm or I_MPI_HYDRA_BOOTSTRAP=slurm variables (see the Developer Reference, “Hydra Environment Variables”).
To run processes on the remote nodes, Hydra uses the srun utility. These environment variables control which utility is used in this case (see the Developer Reference, “Hydra Environment Variables”):
You can also launch applications with the srun utility without Hydra by setting the I_MPI_PMI_LIBRARY environment variable (see the Developer Reference, “Other Environment Variables”).
PMI versions currently supported are PMI-1 and PMI-2.
By default, the Intel MPI Library uses per-host process placement provided by the scheduler. This means that the -ppn option has no effect. To change this behavior and control process placement through -ppn (and related options and variables), set I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off
By default, the Intel MPI Library uses the process pinning provided by Slurm. If the job was launched using mpirun or mpiexec and some Slurm options for pinning were set, then process pinning may be incorrect. In this case, launch your job with srun or enable Intel MPI Library pinning by setting the I_MPI_PIN_RESPECT_CPUSET=0 environment variable (see the Developer Reference, “Process Pinning” and “Environmental Variables for Process Pinning”).
Intel MPI Library process pinning supports some of Slurm’s pinning options. The current list of supported options is: --cpus-per-task.
Examples:
# Allocate nodes. salloc --nodes=<number-of-nodes> --partition=<partition> --ntasks-per-node=<number-of-processes-per-node> # Run your application using Hydra. mpiexec ./myprog #or mpirun ./myprog # Run your application using srun with the PMI-1 interface. I_MPI_PMI_LIBRARY=<path-to-libpmi.so>/libpmi.so srun ./myprog # Run your application using srun with the PMI-2 interface. I_MPI_PMI_LIBRARY=<path-to-libpmi2.so>/libpmi2.so srun --mpi=pmi2 ./myprog # Change per-host process placement. I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off mpiexec -n 2 -ppn 1 ./myprog # Change per-host process placement and hostnames and use srun utility for remote launch. I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off mpiexec -n 2 -ppn 1 -hosts host3,host1 -bootstrap=slurm ./myprog # Use Intel MPI Library pinning. I_MPI_PIN_RESPECT_CPUSET=off mpiexec ./myprog # Use the --cpus-per-task Slurm option in Intel MPI Library pinning. salloc --cpus-per-task=<cpus-per-task> --nodes=<number-of-nodes> --partition=<partition> --ntasks-per-node=<number-of-processes-per-node> I_MPI_PIN_RESPECT_CPUSET=off mpiexec ./myprog #or I_MPI_PIN_DOMAIN=${SLURM_CPUS_PER_TASK} I_MPI_PIN_RESPECT_CPUSET=off mpiexec ./myprog
If you use the Univa Grid Engine job scheduler and the $PE_HOSTFILE is set, then two files will be generated: /tmp/sge_hostfile_${username}_$$ and /tmp/sge_machifile_${username}_$$. The latter is used as the machine file for mpirun. These files are removed when the job is completed.
If resources allocated to a job exceed the limit, most job schedulers terminate the job by sending a signal to all processes.
For example, Torque* sends SIGTERM three times to a job and if this job is still alive, SIGKILL will be sent to terminate it.
For Univa Grid Engine, the default signal to terminate a job is SIGKILL. The Intel MPI Library is unable to process or catch that signal causing mpirun to kill the entire job. You can change the value of the termination signal through the following queue configuration:
Use the following command to see available queues:
$ qconf -sql
Execute the following command to modify the queue settings:
$ qconf -mq <queue_name>
Find terminate_method and change signal to SIGTERM.
Save queue configuration.
The following job schedulers are supported on Windows* OS:
The Intel MPI Library job startup command mpiexec can be called out of Microsoft HPC Job Scheduler to execute an MPI application. In this case, the mpiexec command automatically inherits the host list, process count, and the working directory allocated to the job.
Use the following command to submit an MPI job:
> job submit /numprocessors:4 /stdout:test.out mpiexec -delegate test.exe
Make sure the mpiexec and dynamic libraries are available in PATH.
The Intel MPI Library job startup command mpiexec can be called out of PBS Pro job scheduler to execute an MPI application. In this case the mpiexec command automatically inherits the host list, process count allocated to the job if they were not specified manually by the user. mpiexec reads %PBS_NODEFILE% environment variable to count a number of processes and uses it as a machine file.
Example of a job script contents:
REM PBS -l nodes=4:ppn=2 REM PBS -l walltime=1:00:00 cd %PBS_O_WORKDIR% mpiexec test.exe
Use the following command to submit the job:
> qsub -C "REM PBS" job
mpiexec will run two processes on each of four nodes for this job.
When using a job scheduler, by default Intel MPI Library uses per-host process placement provided by the scheduler. This means that the -ppn option has no effect. To change this behavior and control process placement through -ppn (and related options and variables), use the I_MPI_JOB_RESPECT_PROCESS_PLACEMENT environment variable:
$ export I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off > set I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off