OpenMPI

OpenMPI / mpi4py

If you have OpenMPI and mpi4py installed, then you have access to Gary Thompson's multi-processor framework for MPI parallelisation.

However the code in relax must be written to support this. This is the case for the model-free analysis, in which case Gary has achieved near perfect scaling efficiency:

https://mail.gna.org/public/relax-devel/2007-05/msg00000.html

For the relaxation dispersion branch, no parallelisation has been attempted, neither in the original code from Sebastian Morin or the recent modifications by myself. This is not a simple task and will take a lot of effort to implement. If this is to be implemented one day, it is suggested parallelising at the level of the spin clusters.

It is often quite hard to achieve good scaling efficiency and often the first attempts will just make the code slower, even on a 1024 node cluster, due to the bottleneck of data transfer between the nodes.

The parallelisation will also require 10 times as much code to be written to do the same thing as non-parallised code, and debugging is much more difficult.

Update 2013/09/11

See Commit

Huge speed win for the relaxation dispersion analysis - optimisation now uses the multi-processor.

The relaxation dispersion optimisation has been parallelised at the level of the spin clustering.
It uses Gary Thompson's multi-processor framework. This allows the code to run on multi-core, multi -processor systems, clusters, grids, and anywhere the OpenMPI protocol is available.

Because the parallelisation is at the cluster level there are some situations, whereby instead of optimisation being faster when running on multiple slaves, the optimisation will be slower.
This is the case when all spins being studied in clustered into a small number of clusters.
It is also likely to be slower for the minimise user function when no clustering is defined, due to the overhead costs of data transfer (but for the numeric models, in this case there will be a clear win).

The two situations where there will be a huge performance win is the grid_search user function when no clustering is defined and the Monte Carlo simulations for error analysis.

Test of speed

Performed tests

A - Relax_disp systemtest

set LOG=single.log ; 
relax_single --time -s Relax_disp -t $LOG ; 
set RUNTIME=`cat $LOG | awk '$1 ~ /^\./{print $0}' | awk '{ sum+=$2} END {print sum}'` ;
echo $RUNTIME >> $LOG ;
echo $RUNTIME
# Was between 95-105 seconds

set LOG=multi.log ; 
relax_multi --time -s Relax_disp -t $LOG ; 
set RUNTIME=`cat $LOG | awk '$1 ~ /^\./{print $0}' | awk '{ sum+=$2} END {print sum}'` ;
echo $RUNTIME >> $LOG ;
echo $RUNTIME
# Was between 95-120 seconds

B - Full analysis performed on dataset

First initialize data

relax_single ../software/NMR-relax/relax_disp/test_suite/shared_data/dispersion/KTeilum_FMPoulsen_MAkke_2006/acbp_cpmg_disp_048MGuHCl_40C_041223/relax_1_ini.py

Then run test

set LOG=timing.log ;
set TLOG=log.tmp ;

set MODE=single ;
set RUNPROG="relax_${MODE} ../software/NMR-relax/relax_disp/test_suite/shared_data/dispersion/KTeilum_FMPoulsen_MAkke_2006/acbp_cpmg_disp_048MGuHCl_40C_041223/relax_4_model_sel.py -t ${MODE}.log" ;
echo "---\n$RUNPROG" >> $LOG ;
/usr/bin/time -o $TLOG $RUNPROG ;
cat $TLOG >> $LOG ; 
cat $LOG ;

set MODE=multi ;
set RUNPROG="relax_${MODE} ../software/NMR-relax/relax_disp/test_suite/shared_data/dispersion/KTeilum_FMPoulsen_MAkke_2006/acbp_cpmg_disp_048MGuHCl_40C_041223/relax_4_model_sel.py -t ${MODE}.log" ;
echo "---\n$RUNPROG" >> $LOG ;
/usr/bin/time -o $TLOG $RUNPROG ;
cat $TLOG >> $LOG ; 
cat $LOG

Setup of test

List of computers - the 'lscpu' command

CPU 1

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                2
On-line CPU(s) list:   0,1
Thread(s) per core:    1
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 23
Stepping:              6
CPU MHz:               2659.893
BogoMIPS:              5319.78
L1d cache:             32K
L1i cache:             32K
L2 cache:              3072K
NUMA node0 CPU(s):     0,1

CPU 2

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    2
Core(s) per socket:    6
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 44
Stepping:              2
CPU MHz:               2394.136
BogoMIPS:              4787.82
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              12288K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23

Execution scripts

relax_single

#!/bin/tcsh -fe
# Set the relax version used for this script.
set RELAX=/sbinlab2/tlinnet/software/NMR-relax/relax_disp/relax
# Remove env set to wrong library files.
unsetenv LD_LIBRARY_PATH

# Run relax in multi processor mode.
$RELAX $argv

relax_multi

#!/bin/tcsh -fe
# Set the relax version used for this script.
set RELAX=/sbinlab2/tlinnet/software/NMR-relax/relax_disp/relax
# Remove env set to wrong library files.
unsetenv LD_LIBRARY_PATH

# Set number of available CPUs.
set NPROC=`nproc`
set NP=`echo $NPROC + 1 | bc `

# Run relax in multi processor mode.
/usr/lib64/openmpi/bin/mpirun -np $NP $RELAX --multi='mpi4py' $argv

Results

Computer	Nr of CPU's.	Test type	Nr of spins	Nr exp.	GRID_INC	MC_NUM	MODELS	Time (s)
CPU 1	1	A	82	16	11	50	['R2eff', 'No Rex', 'TSMFK01', 'LM63', 'LM63 3-site', 'CR72', 'CR72 full', 'IT99', 'NS CPMG 2-site 3D', 'NS CPMG 2-site expanded', 'NS CPMG 2-site star']

Notes:

Nr exp. = Nr of experiments = Nr of CPMG frequencies subtracted repetitions and reference spectrums.