OpenMPI
mpi4py OpenMPI
This package provides Python bindings for the Message Passing Interface (MPI) standard. It is implemented on top of the MPI-1/2/3 specification and exposes an API which grounds on the standard MPI-2 C++ bindings.
relax manual on Multi processor usage
If you have OpenMPI and mpi4py installed, then you have access to Gary Thompson's multi-processor framework for MPI parallelisation.
Gary has achieved near perfect scaling efficiency:
https://mail.gna.org/public/relax-devel/2007-05/msg00000.html
Dependencies
- Python 2.4 to 2.7 or 3.0 to 3.4, or a recent PyPy release.
- A functional MPI 1.x/2.x/3.x implementation like MPICH or Open MPI built with shared/dynamic libraries.
Install OpenMPI on linux and set environments
See https://www10.informatik.uni-erlangen.de/Cluster/
# Install openmpi-devel, to get 'mpicc'
sudo yum install openmpi-devel
# Check for mpicc
which mpicc
# If not found set environments by loading module
# See avail
module avail
# Show what loading does
module show openmpi-x86_64
# See if anything is loaded
module list
# Load
module load openmpi-x86_64
module list
# Check for mpicc, mpirun or mpiexec
which mpicc
which mpirun
which mpiexec
# Unload
module unload openmpi-x86_64
In .cshrc file, one could put
# Open MPI: Open Source High Performance Computing
foreach x (tomat bax minima elvis)
if ( $HOST == $x) then
module load openmpi-x86_64
endif
end
If not found, try this fix, ref: http://forums.fedoraforum.org/showthread.php?t=194688
#For 32 computer.
sudo ln -s /usr/lib/openmpi/bin/mpicc /usr/bin/mpicc
# For 64 bit computer.
sudo ln -s /usr/lib64/openmpi/bin/mpicc /usr/bin/mpicc
Install mpi4py
Linux and Mac
Remember to check, if there are newer versions of mpi4py.
The mpi4py library can be installed on all UNIX systems by typing:
# Change to bash, if in tcsh shell
#bash
v=1.3.1
#tcsh
set v=1.3.1
pip install https://bitbucket.org/mpi4py/mpi4py/downloads/mpi4py-$v.tar.gz
pip install https://bitbucket.org/mpi4py/mpi4py/downloads/mpi4py-$v.tar.gz --upgrade
cd ..
Relax In multiprocessor mode
tcsh
set RELAX=`which relax`
# Normal
mpirun -np N+1 $RELAX --multi='mpi4py'
# In gui
mpirun -np N+1 $RELAX --multi='mpi4py' -g
where N is the number of slaves you have. See the mpirun documentation for details - this is not part of relax.
This code runs in the GUI, the script UI and the prompt UI, i.e. everywhere.
Helper start scripts
If you have several versions or development branches of relax installed, you could probably use some of these scripts, and put them in your PATH.
Script for force running relax on server computer
This script exemplifies a setup, where the above installation requirements is met on one server computer haddock, and where satellite computers are forced to run on this computer.
The file relax_trunk is made executable (chmod +x relax_trunk), and put in a PATH, known by all satellite computers.
#!/bin/tcsh -f
# Set the lax version used for this script.
set RELAX=/network_drive/software_user/software/NMR-relax/relax_trunk/relax
# Check machine, since only machine haddock have correct packages installed.
if ( $HOST != "haddock") then
echo "You have to run on haddock. I do it for you"
ssh haddock -Y -t "cd $PWD; $RELAX $argv; /bin/tcsh"
else
$RELAX $argv
endif
Script for running relax with maximum number of processors available
This script exemplifies a setup, to test the running relax with maximum number of processors.
The file relax_test is made executable, and put in a PATH, known by all satellite computers.
#!/bin/tcsh -fe
# Set the relax version used for this script.
set RELAX=/sbinlab2/tlinnet/software/NMR-relax/relax_trunk/relax
# Set number of available CPUs.
set NPROC=`nproc`
set NP=`echo $NPROC + 1 | bc `
echo "Running relax with NP=$NP in multi-processor mode"
# Run relax in multi processor mode.
mpirun -np $NP $RELAX --multi='mpi4py' $argv
Script for force running relax on server computer with openmpi
#!/bin/tcsh
# Set the lax version used for this script.
set RELAX=/sbinlab2/software/NMR-relax/relax_trunk/relax
# Set number of available CPUs.
#set NPROC=`nproc`
set NPROC=10
set NP=`echo $NPROC + 1 | bc `
# Run relax in multi processor mode.
set RELAXRUN="mpirun -np $NP $RELAX --multi='mpi4py' $argv"
# Check machine, since only machine haddock have openmpi-devel installed
if ( $HOST != "haddock") then
echo "You have to run on haddock. I do it for you"
ssh haddock -Y -t "cd $PWD; $RELAXRUN; /bin/tcsh"
else
mpirun -np $NP $RELAX --multi='mpi4py' $argv
endif
Commands and FAQ about mpirun
See oracles page on mpirun and the manual openmpi:
- https://docs.oracle.com/cd/E19923-01/820-6793-10/ExecutingPrograms.html
- http://www.open-mpi.org/doc/v1.4/man1/mpirun.1.php
For a simple SPMD (Single Process, Multiple Data) job, the typical syntax is:
mpirun -np x program-name
Find number of Socket, Cores and Threads
See http://blogs.cisco.com/performance/open-mpi-v1-5-processor-affinity-options
lscpu | egrep -e "CPU|Thread|Core|Socket"
--- tomat
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
CPU family: 6
CPU MHz: 1600.000
NUMA node0 CPU(s): 0-3
--- Machine haddock
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 2
CPU family: 6
CPU MHz: 2394.135
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23
Test binding to socket
module load openmpi-x86_64
Output from a machine with: Thread(s) per core: 1, Core(s) per socket: 4, Socket(s): 1
mpirun --report-bindings -np 4 relax --multi='mpi4py'
[tomat:28223] MCW rank 2 bound to socket 0[core 2[hwt 0]]: [././B/.]
[tomat:28223] MCW rank 3 bound to socket 0[core 3[hwt 0]]: [./././B]
[tomat:28223] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././.]
[tomat:28223] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./.]
Output when running with to many processed from a machine with: Thread(s) per core: 1, Core(s) per socket: 4, Socket(s): 1
mpirun --report-bindings -np 5 relax --multi='mpi4py'
[tomat:31434] MCW rank 0 is not bound (or bound to all available processors)
[tomat:31434] MCW rank 1 is not bound (or bound to all available processors)
[tomat:31434] MCW rank 2 is not bound (or bound to all available processors)
[tomat:31434] MCW rank 3 is not bound (or bound to all available processors)
[tomat:31434] MCW rank 4 is not bound (or bound to all available processors)
Output from a machine with: Thread(s) per core: 2, Core(s) per socket: 6, Socket(s): 2
mpirun --report-bindings -np 11 relax --multi='mpi4py'
[haddock:31110] MCW rank 6 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../..][../../../../../..]
[haddock:31110] MCW rank 7 bound to socket 1[core 9[hwt 0-1]]: [../../../../../..][../../../BB/../..]
[haddock:31110] MCW rank 8 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/..][../../../../../..]
[haddock:31110] MCW rank 9 bound to socket 1[core 10[hwt 0-1]]: [../../../../../..][../../../../BB/..]
[haddock:31110] MCW rank 10 bound to socket 0[core 5[hwt 0-1]]: [../../../../../BB][../../../../../..]
[haddock:31110] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../..][../../../../../..]
[haddock:31110] MCW rank 1 bound to socket 1[core 6[hwt 0-1]]: [../../../../../..][BB/../../../../..]
[haddock:31110] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../..][../../../../../..]
[haddock:31110] MCW rank 3 bound to socket 1[core 7[hwt 0-1]]: [../../../../../..][../BB/../../../..]
[haddock:31110] MCW rank 4 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../..][../../../../../..]
[haddock:31110] MCW rank 5 bound to socket 1[core 8[hwt 0-1]]: [../../../../../..][../../BB/../../..]
Use mpirun with ssh hostfile
See
- https://www.open-mpi.org/faq/?category=running#mpirun-hostfile
- http://mirror.its.dal.ca/openmpi/faq/?category=running#simple-spmd-run
- https://www.open-mpi.org/faq/?category=rsh
- https://docs.oracle.com/cd/E19923-01/820-6793-10/ExecutingBatchPrograms.html
We have the 3 machines bax minima elvis.
Let's try to make a hostfile and use them at the same time
set MPIHARR = (bax minima elvis)
foreach MPIH ($MPIHARR)
ssh $MPIH 'echo $HOST; lscpu | egrep -e "Thread|Core|Socket"; module list'
echo ""
end
Output
bax
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Currently Loaded Modulefiles:
1) openmpi-x86_64
minima
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Currently Loaded Modulefiles:
1) openmpi-x86_64
elvis
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Currently Loaded Modulefiles:
1) openmpi-x86_64
The node machines is a quad-processor machine, and we want to reserve 1 cpu for the user at the machine.
Make a host file
cat << EOF > relax_hosts
bax slots=3 max-slots=4
minima slots=3 max-slots=4
elvis slots=3 max-slots=4
EOF
cat relax_hosts
Then try to run
ssh localhost env | grep -i path
ssh bax env | grep -i path
mpirun --host localhost hostname
mpirun --mca plm_base_verbose 10 --host localhost hostname
mpirun --host bax hostname
which mpirun
/usr/lib64/openmpi/bin/mpirun
mpirun --prefix /usr/lib64/openmpi --host bax hostname
mpirun --mca plm_base_verbose 10 --host bax hostname
mpirun --host localhost,bax hostname
mpirun --mca plm_base_verbose 10 --host localhost,bax hostname
mpirun --report-bindings -np 1 --host localhost uptime
mpirun --report-bindings -np 2 --host tomat uptime
mpirun --report-bindings -np 4 -mca plm_rsh_agent ssh --hostfile relax_hosts uptime
Updates
Update 2013/09/11
See Commit
Huge speed win for the relaxation dispersion analysis - optimisation now uses the multi-processor.
The relaxation dispersion optimisation has been parallelised at the level of the spin clustering.
It uses Gary Thompson's multi-processor framework. This allows the code to run on multi-core, multi
-processor systems, clusters, grids, and anywhere the OpenMPI protocol is available.
Because the parallelisation is at the cluster level there are some situations, whereby instead of
optimisation being faster when running on multiple slaves, the optimisation will be slower.
This is the case when all spins being studied is clustered into a small number of clusters. Example 100 spins into 1 cluster.
It is also likely to be slower for the minimise user function when no clustering is defined, due to the
overhead costs of data transfer (but for the numeric models, in this case there will be a clear win).
The two situations where there will be a huge performance win' is the grid_search user function when no clustering is defined and the Monte Carlo simulations for error analysis.
Test of speed
Performed tests
A - Relax_disp systemtest
Relax_disp_systemtest
#!/bin/tcsh
set LOG=single.log ;
relax_single --time -s Relax_disp -t $LOG ;
set RUNTIME=`cat $LOG | awk '$1 ~ /^\./{print $0}' | awk '{ sum+=$2} END {print sum}'` ;
echo $RUNTIME >> $LOG ;
echo $RUNTIME ;
set LOG=multi.log ;
relax_multi --time -s Relax_disp -t $LOG ;
set RUNTIME=`cat $LOG | awk '$1 ~ /^\./{print $0}' | awk '{ sum+=$2} END {print sum}'` ;
echo $RUNTIME >> $LOG ;
echo $RUNTIME
B - Relax full analysis performed on dataset
First initialize data
set CPU1=tomat ;
set CPU2=haddock ;
set MODE1=single ;
set MODE2=multi ;
set DATA=$HOME/software/NMR-relax/relax_disp/test_suite/shared_data/dispersion/KTeilum_FMPoulsen_MAkke_2006/acbp_cpmg_disp_048MGuHCl_40C_041223/ ;
set TDATA=$HOME/relax_results
mkdir -p $TDATA/$CPU1 $TDATA/$CPU2 ;
cp -r $DATA $TDATA/$CPU1/$MODE1 ;
cp -r $DATA $TDATA/$CPU1/$MODE2 ;
cp -r $DATA $TDATA/$CPU2/$MODE1 ;
cp -r $DATA $TDATA/$CPU2/$MODE2 ;
relax_single $TDATA/$CPU1/$MODE1/relax_1_ini.py ;
relax_single $TDATA/$CPU1/$MODE2/relax_1_ini.py ;
relax_single $TDATA/$CPU2/$MODE1/relax_1_ini.py ;
relax_single $TDATA/$CPU2/$MODE2/relax_1_ini.py ;
Relax_full_analysis_performed_on_dataset
#!/bin/tcsh -e
set CPU=$HOST ;
set MODE1=single ;
set MODE2=multi ;
set TDATA=$HOME/relax_results
set LOG=timing.log ;
set TLOG=log.tmp ;
cd $TDATA
set MODE=$MODE1 ;
set RUNPROG="relax_${MODE} $TDATA/$CPU/$MODE/relax_4_model_sel.py -t ${CPU}_${MODE}.log" ;
echo "---\n$RUNPROG" >> $LOG ;
/usr/bin/time -o $TLOG $RUNPROG ;
cat $TLOG >> $LOG ;
cat $LOG ;
set MODE=$MODE2 ;
set RUNPROG="relax_${MODE} $TDATA/$CPU/$MODE/relax_4_model_sel.py -t ${CPU}_${MODE}.log" ;
echo "---\n$RUNPROG" >> $LOG ;
/usr/bin/time -o $TLOG $RUNPROG ;
cat $TLOG >> $LOG ;
cat $LOG ;
C - Relax full analysis performed on dataset with clustering
Relax_full_analysis_performed_on_dataset_cluster
#!/bin/tcsh -e
set CPU=$HOST ;
set MODE1=single ;
set MODE2=multi ;
set TDATA=$HOME/relax_results
set LOG=timing.log ;
set TLOG=log.tmp ;
cd $TDATA
set MODE=$MODE1 ;
set RUNPROG="relax_${MODE} $TDATA/$CPU/$MODE/relax_5_cluster.py -t ${CPU}_${MODE}_cluster.log" ;
echo "---\n$RUNPROG" >> $LOG ;
/usr/bin/time -o $TLOG $RUNPROG ;
cat $TLOG >> $LOG ;
cat $LOG ;
set MODE=$MODE2 ;
set RUNPROG="relax_${MODE} $TDATA/$CPU/$MODE/relax_5_cluster.py -t ${CPU}_${MODE}_cluster.log" ;
echo "---\n$RUNPROG" >> $LOG ;
/usr/bin/time -o $TLOG $RUNPROG ;
cat $TLOG >> $LOG ;
cat $LOG ;
Setup of test
List of computers - the 'lscpu' command
CPU 1
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 23
Stepping: 6
CPU MHz: 2659.893
BogoMIPS: 5319.78
L1d cache: 32K
L1i cache: 32K
L2 cache: 3072K
NUMA node0 CPU(s): 0,1
CPU 2
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 44
Stepping: 2
CPU MHz: 2394.136
BogoMIPS: 4787.82
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 12288K
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23
Execution scripts
relax_single
#!/bin/tcsh -fe
# Set the relax version used for this script.
set RELAX=/sbinlab2/tlinnet/software/NMR-relax/relax_disp/relax
# Remove env set to wrong library files.
unsetenv LD_LIBRARY_PATH
# Run relax in multi processor mode.
$RELAX $argv
relax_multi
#!/bin/tcsh -fe
# Set the relax version used for this script.
set RELAX=/sbinlab2/tlinnet/software/NMR-relax/relax_disp/relax
# Remove env set to wrong library files.
unsetenv LD_LIBRARY_PATH
# Set number of available CPUs.
set NPROC=`nproc`
set NP=`echo $NPROC + 1 | bc `
# Run relax in multi processor mode.
mpirun -np $NP $RELAX --multi='mpi4py' $argv
Results
Computer | Nr of CPU's. | Test type | Nr of spins | Nr exp. | GRID_INC | MC_NUM | MODELS | Time (s) |
---|---|---|---|---|---|---|---|---|
CPU 1 | 1 | A | - | - | - | - | - | 95, 105 |
CPU 1 | 2 | A | - | - | - | - | - | 96, 120 |
CPU 2 | 1 | A | - | - | - | - | - | 85, 78 |
CPU 2 | 24 | A | - | - | - | - | - | 133, 143 |
CPU 1 | 1 | B | 82 | 16 | 11 | 50 | MODEL_ALL, single res | 9:16:33 |
CPU 1 | 2 | B | 82 | 16 | 11 | 50 | MODEL_ALL, single res | 8:06:44 |
CPU 2 | 1 | B | 82 | 16 | 11 | 50 | MODEL_ALL, single res | 8:18:21 |
CPU 2 | 24 | B | 82 | 16 | 11 | 50 | MODEL_ALL, single res | 2:17:02 |
CPU 1 | 1 | C | 78 | 16 | 11 | 50 | 'R2eff', 'No Rex', 'TSMFK01', clustering | 71:32:18 |
CPU 1 | 2 | C | 78 | 16 | 11 | 50 | 'R2eff', 'No Rex', 'TSMFK01', clustering | 82:27:13 |
CPU 2 | 1 | C | 78 | 16 | 11 | 50 | 'R2eff', 'No Rex', 'TSMFK01', clustering | 58:45:47 |
CPU 2 | 24 | C | 78 | 16 | 11 | 50 | 'R2eff', 'No Rex', 'TSMFK01', clustering | 145:01:33 |
Notes:
- Nr exp. = Nr of experiments = Nr of CPMG frequencies subtracted repetitions and reference spectrums.
- MODEL_ALL = ['R2eff', 'No Rex', 'TSMFK01', 'LM63', 'LM63 3-site', 'CR72', 'CR72 full', 'IT99', 'NS CPMG 2-site 3D', 'NS CPMG 2-site expanded', 'NS CPMG 2-site star']