Open main menu

Changes

OpenMPI

15,048 bytes added, 08:45, 6 October 2020
m
== OpenMPI / mpi4py ==[http://www.nmr-Open MPI is a Message Passing Interface (MPI) used in relaxfor parallalised calculations.com/manual/Usage_multi_processor.html Manual on Multi processor usage]
If you have == mpi4py OpenMPI and mpi4py installed, then you have access to Gary==Thompson's multiThis package provides Python bindings for the Message Passing Interface (MPI) standard. It is implemented on top of the MPI-processor framework for 1/2/3 specification and exposes an API which grounds on the standard MPI parallelisation-2 C++ bindings.
However the code in [http://www.nmr-relax must be written to support this. This is the casefor the model-free analysis, in which case Gary has achieved nearperfect scaling efficiency:com/manual/Usage_multi_processor.html relax manual on Multi processor usage]
https://mail.gna.org/public/relax-devel/2007If you have OpenMPI and mpi4py installed, then you have access to Gary Thompson's multi-05/msg00000processor framework for MPI parallelisation.html
For the relaxation dispersion branch, no parallelisation Gary hasbeen attempted, neither in the original code from Sebastian Morin orthe recent modifications by myself. This is not a simple task andwill take a lot of effort to implement. If this is to be implementedone day, it is suggested parallelising at the level of the spinclusters.achieved near perfect scaling efficiency:
It is often quite hard to achieve good scalingefficiency and often the first attempts will just make the codeslower, even on a 1024 node cluster, due to the bottleneck of datatransfer between the nodes{{gna mailing list url|relax-devel/2007-05/msg00000. html}}
The parallelisation will also require 10=== Dependencies ===times as much code # Python 2.4 to be written 2.7 or 3.0 to do the same thing as3.4, or a recent PyPy release.non-parallised code, and debugging is much more difficult# A functional MPI 1.x/2.x/3.x implementation like MPICH or Open MPI built with shared/dynamic libraries.
=== How does it work ? === If mpirun is started with "np 4", relax will get 4 "rank" processors. <br>relax will organize 1 processor as "organizer" which sends jobs to 3 slaves, and receive and organize the returned results. relax will for the main part of the code '''only use 1 processor'''. <br>Only the computational heavy parts of relax are prepared for multi processing. <br>And the computational heavy part is '''minimising''' to find optimal parameters. <br>All other parts of relax are "light-weight", which only read/writes text files, or organize data.  For minimization, relax collects all data and functions organized into classes, and pack this as independent job to be send to a slave for minimizing.<br>To pack a job, the job needs to have sufficient information to "live an independent life". The data to minimize, the functions to use, how treat the result, and send it return to the master. <br>The master server asks this:"Slave, find the minimization of this spin cluster, and return the results to me when you are done, then you get a new job". For example 10 slaves, this makes it possible to# Simultaneously minimise 10 free spins.# Simultaneously make 10 Monte-Carlo simulations === Install OpenMPI on linux and set environments ===See https://www10.informatik.uni-erlangen.de/Cluster/ <source lang="bash"># Install openmpi-devel, to get 'mpicc'sudo yum install openmpi-devel # Check for mpiccwhich mpicc # If not found set environments by loading module# See availmodule avail # Show what loading doesmodule show openmpi-x86_64# ormodule show openmpi-1.10-x86_64 # See if anything is loadedmodule list # Loadmodule load openmpi-x86_64# Ormodule load openmpi-1.10-x86_64 # See listmodule list # Check for mpicc, mpirun or mpiexecwhich mpiccwhich mpirunwhich mpiexec # Unloadmodule unload openmpi-x86_64</source> In .cshrc file, one could put<source lang="bash"># Open MPI: Open Source High Performance Computingforeach x (tomat bax minima elvis) if ( $HOST == $x) then module load openmpi-x86_64 endifend</source> If not found, try this fix, ref: http://forums.fedoraforum.org/showthread.php?t=194688<source lang="bash">#For 32 computer.sudo ln -s /usr/lib/openmpi/bin/mpicc /usr/bin/mpicc# For 64 bit computer.sudo ln -s /usr/lib64/openmpi/bin/mpicc /usr/bin/mpicc# orsudo ln -s /usr/lib64/openmpi-1.10/bin/mpicc /usr/bin/mpicc </source> == Install mpi4py == === Linux and Mac ===Remember to check, if there are newer versions of [https://bitbucket.org/mpi4py/mpi4py/downloads/ mpi4py]. <br>The [https://bitbucket.org/mpi4py/mpi4py mpi4py] library can be installed on all UNIX systems by typing:{{#tag:source|# Change to bash, if in tcsh shell#bashv={{current version mpi4py}} #tcshset v={{current version mpi4py}}  pip install https://bitbucket.org/mpi4py/mpi4py/downloads/mpi4py-$v.tar.gzpip install https://bitbucket.org/mpi4py/mpi4py/downloads/mpi4py-$v.tar.gz --upgrade # Or to use another python interpreter than standardwget https://bitbucket.org/mpi4py/mpi4py/downloads/mpi4py-$v.tar.gztar -xzf mpi4py-$v.tar.gzrm mpi4py-$v.tar.gzcd mpi4py-$v# Use the path to the python to build withpython setup.py buildpython setup.py installcd ..rm -rf mpi4py-$v|lang="bash"}} Then test<source lang="python">pythonimport mpi4pympi4py.__file__</source> == Relax In multiprocessor mode == '''How many processors should I start?''' <br>You should start as many cores as you have. '''But not counting in threads'''. <br>In this example, you can start 12 (6*2), where Relax will take 1 for receiving results, and 11 for calculations.<source lang="bash">lscpu----CPU(s): 24On-line CPU(s) list: 0-23Thread(s) per core: 2Core(s) per socket: 6Socket(s): 2</source> You can continue try this, until a good results<source lang="bash"># With shellmpirun -np 4 echo "hello world" # With pythonmpirun -np 4 python -m mpi4py helloworld # If newer version of mpirun, then --report-bindings worksmpirun --report-bindings -np 4 echo "hello world"mpirun --report-bindings -np 12 echo "hello world"# This is too muchmpirun --report-bindings -np 13 echo "hello world"</source> Script for ex<source lang="bash">tcsh set RELAX=`which relax` # Normalmpirun -np 12 $RELAX --multi='mpi4py' # In guimpirun -np 12 $RELAX --multi='mpi4py' -g</source> where N is the number of slaves you have. See the mpirun documentation for details - this is not part of relax. <br>This code runs in the GUI, the script UI and the prompt UI, i.e. everywhere. === Helper start scripts ===If you have several versions or development branches of relax installed, you could probably use some of these scripts, and put them in your PATH. ==== Script for force running relax on server computer ====This script exemplifies a setup, where the above installation requirements is met on one server computer ''haddock'', and where satellite computers are forced to run on this computer.  The file '''relax_trunk''' is made executable (''chmod +x relax_trunk''), and put in a PATH, known by all satellite computers.<source lang="bash">#!/bin/tcsh -f # Set the lax version used for this script.set RELAX=/network_drive/software_user/software/NMR-relax/relax_trunk/relax # Check machine, since only machine haddock have correct packages installed.if ( $HOST != "haddock") then echo "You have to run on haddock. I do it for you" ssh haddock -Y -t "cd $PWD; $RELAX $argv; /bin/tcsh"else $RELAX $argvendif</source> ==== Script for running relax with maximum number of processors available ====This script exemplifies a setup, to test the running relax with maximum number of processors.  The file '''relax_test''' is made executable, and put in a PATH, known by all satellite computers.<source lang="bash">#!/bin/tcsh -fe # Set the relax version used for this script.set RELAX=/sbinlab2/tlinnet/software/NMR-relax/relax_trunk/relax # Set number of available CPUs.set NPROC=`nproc`set NP=`echo $NPROC + 0 | bc ` echo "Running relax with NP=$NP in multi-processor mode" # Run relax in multi processor mode.mpirun -np $NP $RELAX --multi='mpi4py' $argv</source> ==== Script for force running relax on server computer with openmpi ====<source lang="bash">#!/bin/tcsh # Set the lax version used for this script.set RELAX=/sbinlab2/software/NMR-relax/relax_trunk/relax # Set number of available CPUs.#set NPROC=`nproc`set NPROC=10set NP=`echo $NPROC + 0 | bc ` # Run relax in multi processor mode.set RELAXRUN="mpirun -np $NP $RELAX --multi='mpi4py' $argv" # Check machine, since only machine haddock have openmpi-devel installedif ( $HOST != "haddock") then echo "You have to run on haddock. I do it for you" ssh haddock -Y -t "cd $PWD; $RELAXRUN; /bin/tcsh"else mpirun -np $NP $RELAX --multi='mpi4py' $argvendif</source> == Setting up relax on super computer Beagle2 == Please see post from Lora Picton [[relax_on_Beagle2]] Message: http://thread.gmane.org/gmane.science.nmr.relax.user/1821 == Commands and FAQ about mpirun ==See oracles page on mpirun and the manual openmpi: # https://docs.oracle.com/cd/E19923-01/820-6793-10/ExecutingPrograms.html# http://www.open-mpi.org/doc/v1.4/man1/mpirun.1.php For a simple SPMD (Single Process, Multiple Data) job, the typical syntax is:<source lang="bash">mpirun -np x program-name</source> === Find number of Socket, Cores and Threads ===See http://blogs.cisco.com/performance/open-mpi-v1-5-processor-affinity-options <source lang="bash">lscpu | egrep -e "CPU|Thread|Core|Socket"</source><source lang="text">--- tomatCPU(s): 4On-line CPU(s) list: 0-3Thread(s) per core: 1Core(s) per socket: 4Socket(s): 1CPU family: 6CPU MHz: 1600.000NUMA node0 CPU(s): 0-3--- Machine haddockCPU(s): 24On-line CPU(s) list: 0-23Thread(s) per core: 2Core(s) per socket: 6Socket(s): 2CPU family: 6CPU MHz: 2394.135NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23</source> === Test binding to socket ===<source lang="bash">module load openmpi-x86_64</source> Output from a machine with: Thread(s) per core: 1, Core(s) per socket: 4, Socket(s): 1<source lang="text">mpirun --report-bindings -np 4 relax --multi='mpi4py'[tomat:28223] MCW rank 2 bound to socket 0[core 2[hwt 0]]: [././B/.][tomat:28223] MCW rank 3 bound to socket 0[core 3[hwt 0]]: [./././B][tomat:28223] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././.][tomat:28223] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./.]</source> Output when running with to many processed from a machine with: Thread(s) per core: 1, Core(s) per socket: 4, Socket(s): 1<source lang="text">mpirun --report-bindings -np 5 relax --multi='mpi4py'[tomat:31434] MCW rank 0 is not bound (or bound to all available processors)[tomat:31434] MCW rank 1 is not bound (or bound to all available processors)[tomat:31434] MCW rank 2 is not bound (or bound to all available processors)[tomat:31434] MCW rank 3 is not bound (or bound to all available processors)[tomat:31434] MCW rank 4 is not bound (or bound to all available processors)</source> Output from a machine with: Thread(s) per core: 2, Core(s) per socket: 6, Socket(s): 2<source lang="text">mpirun --report-bindings -np 11 relax --multi='mpi4py'[haddock:31110] MCW rank 6 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../..][../../../../../..][haddock:31110] MCW rank 7 bound to socket 1[core 9[hwt 0-1]]: [../../../../../..][../../../BB/../..][haddock:31110] MCW rank 8 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/..][../../../../../..][haddock:31110] MCW rank 9 bound to socket 1[core 10[hwt 0-1]]: [../../../../../..][../../../../BB/..][haddock:31110] MCW rank 10 bound to socket 0[core 5[hwt 0-1]]: [../../../../../BB][../../../../../..][haddock:31110] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../..][../../../../../..][haddock:31110] MCW rank 1 bound to socket 1[core 6[hwt 0-1]]: [../../../../../..][BB/../../../../..][haddock:31110] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../..][../../../../../..][haddock:31110] MCW rank 3 bound to socket 1[core 7[hwt 0-1]]: [../../../../../..][../BB/../../../..][haddock:31110] MCW rank 4 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../..][../../../../../..][haddock:31110] MCW rank 5 bound to socket 1[core 8[hwt 0-1]]: [../../../../../..][../../BB/../../..]</source> == Use mpirun with ssh hostfile == {{caution|This is test only. This appears not to function well!}} See # https://www.open-mpi.org/faq/?category=running#mpirun-hostfile# http://mirror.its.dal.ca/openmpi/faq/?category=running#simple-spmd-run# https://www.open-mpi.org/faq/?category=rsh# https://docs.oracle.com/cd/E19923-01/820-6793-10/ExecutingBatchPrograms.html We have the 3 machines '''bax minima elvis'''.<br> Let's try to make a hostfile and use them at the same time <source lang="bash">set MPIHARR = (bax minima elvis)foreach MPIH ($MPIHARR)ssh $MPIH 'echo $HOST; lscpu | egrep -e "Thread|Core|Socket"; module list'echo ""end</source>Output<source lang="text">baxThread(s) per core: 1Core(s) per socket: 4Socket(s): 1Currently Loaded Modulefiles: 1) openmpi-x86_64 minimaThread(s) per core: 1Core(s) per socket: 4Socket(s): 1Currently Loaded Modulefiles: 1) openmpi-x86_64 elvisThread(s) per core: 1Core(s) per socket: 4Socket(s): 1Currently Loaded Modulefiles: 1) openmpi-x86_64</source> The node machines is a quad-processor machine, and we want to reserve 1 cpu for the user at the machine. Make a host file, Currently, I cannot get more than 2 ssh to work at the same time.<source lang="bash">cat << EOF > relax_hostslocalhost slots=3 max-slots=4bax slots=3 max-slots=4minima slots=3 max-slots=4#elvis slots=3 max-slots=4EOF cat relax_hosts</source> Then try to run some tests <source lang="bash"># Check first environmentsssh localhost env | grep -i pathssh bax env | grep -i path mpirun --host localhost hostnamempirun --mca plm_base_verbose 10 --host localhost hostname # On another machine, this will not work because of the firewallmpirun --host bax hostname # Verbose for baxmpirun --mca plm_base_verbose 10 --host bax hostnamempirun --mca rml_base_verbose 10 --host bax hostname # This shows that TCP is having problems: tcp_peer_complete_connect: connection failed with error 113mpirun --mca oob_base_verbose 10 --host bax hostname# Shutdown firewallsudo iptables -L -nsudo service iptables stopsudo iptables -L -n# Try againmpirun --mca oob_base_verbose 10 --host bax hostnamempirun --host bax hostname # Now trympirun --host localhost,bax hostnamempirun --host localhost,bax,minima hostnamempirun --host localhost,bax,elvis hostname # Test why 4 machines not workingmpirun --mca plm_base_verbose 10 --host localhost,bax,elvis,minima hostnamempirun --mca rml_base_verbose 10 --host localhost,bax,elvis,minima hostnamempirun --mca oob_base_verbose 10 --host localhost,bax,elvis,minima hostname # Try just 3 machinesmpirun --report-bindings --hostfile relax_hosts hostnamempirun --report-bindings -np 9 --hostfile relax_hosts hostnamempirun --report-bindings -np 9 --hostfile relax_hosts uptime # Now try relaxmpirun --report-bindings --hostfile relax_hosts relax --multi='mpi4py'</source> == Running Parallel Jobs with queue system ==See:# https://docs.oracle.com/cd/E19923-01/820-6793-10/ExecutingBatchPrograms.html=== Running Parallel Jobs in the Sun Grid Engine Environment ===See# https://www.open-mpi.org/faq/?category=building#build-rte-sge# http://fsl.fmrib.ox.ac.uk/fsl/fslwiki/FslSge Test if you have it<source lang="bash">ompi_info | grep gridengine</source> === Running Parallel Jobs in the PBS/Torque ===See# https://www.open-mpi.org/faq/?category=building#build-rte-tm# http://www-01.ibm.com/support/knowledgecenter/linuxonibm/liaai.hpcrh/buildtorque.htm# http://www-01.ibm.com/support/knowledgecenter/linuxonibm/liaai.hpcrh/installmn.htm%23installingthemanagementnode?lang=enTest if you have it<source lang="bash">ompi_info | grep tm</source> === Running Parallel Jobs in SLURM ===See# https://www.open-mpi.org/faq/?category=building#build-rte-sge# https://www.open-mpi.org/faq/?category=slurm Test if you have it<source lang="bash">ompi_info | grep slurm</source> == Updates == === Update 2013/09/11 ===
See [http://thread.gmane.org/gmane.science.nmr.relax.scm Commit]
Because the parallelisation is at the cluster level there are some situations, whereby instead of
optimisation being faster when running on multiple slaves, the optimisation will be slower. <br>
'''This is the case ''' when '''all spins ''' being studied is clustered into a '''small number ''' of clusters. Example 100 spins into 1 cluster. <br>
It is also likely to be slower for the minimise user function when no clustering is defined, due to the
overhead costs of data transfer (but for the numeric models, in this case there will be a clear win).
The two situations where there will be a '''huge performance win '''' is the '''grid_search ''' user function when'''no clustering ''' is defined and the Monte Carlo simulations for error analysis.
=== Test of speed ===
==== Performed tests ====
===== A - Relax_disp systemtest =====
'''Relax_disp_systemtest'''
<source lang="bash">
</source>
===== B - Relax full analysis performed on dataset =========== First initialize data======
<source lang="bash">
set CPU1=tomat ;
relax_single $TDATA/$CPU2/$MODE2/relax_1_ini.py ;
</source>
'''====== Relax_full_analysis_performed_on_dataset'''======
<source lang="bash">
#!/bin/tcsh -e
set MODE=$MODE1 ;
set RUNPROG="relax_${MODE} $TDATA/$CPU/$MODE/relax_4_model_sel.py -t ${CPU}_${MODE}.log" ;
echo "---\n$RUNPROG" >> $LOG ;
/usr/bin/time -o $TLOG $RUNPROG ;
set MODE=$MODE2 ;
set RUNPROG="relax_${MODE} $TDATA/$CPU/$MODE/relax_4_model_sel.py -t ${CPU}_${MODE}.log" ;
echo "---\n$RUNPROG" >> $LOG ;
/usr/bin/time -o $TLOG $RUNPROG ;
cat $LOG ;
</source>
===== C - Relax full analysis performed on dataset with clustering =====
 
'''Relax_full_analysis_performed_on_dataset_cluster'''
<source lang="bash">
#!/bin/tcsh -e
set CPU=$HOST ;
set MODE1=single ;
set MODE2=multi ;
set TDATA=$HOME/relax_results
set LOG=timing.log ;set TLOG== Setup of test ===log.tmp ;cd $TDATA
set MODE=$MODE1 ;set RUNPROG="relax_${MODE} $TDATA/$CPU/$MODE/relax_5_cluster.py -t ${CPU}_${MODE}_cluster.log" ;echo "---\n$RUNPROG" >> $LOG ;/usr/bin/time -o $TLOG $RUNPROG ;cat $TLOG >> $LOG ; cat $LOG ; set MODE=$MODE2 ;set RUNPROG="relax_${MODE} $TDATA/$CPU/$MODE/relax_5_cluster.py -t ${CPU}_${MODE}_cluster.log" ;echo "---\n$RUNPROG" >> $LOG ;/usr/bin/time -o $TLOG $RUNPROG ;cat $TLOG >> $LOG ; cat $LOG ;</source> ==== Setup of test ==== ===== List of computers - the 'lscpu' command =====
CPU 1
<source lang="text">
</source>
===== Execution scripts =====
'''relax_single'''
<source lang="bash">
# Run relax in multi processor mode.
/usr/lib64/openmpi/bin/mpirun -np $NP $RELAX --multi='mpi4py' $argv
</source>
 
==== Installation ====
Installation of '''openmpi''' and the '''mpi4py''' python package according to [[Installation_linux#mpi4py | wiki notes]].
=== Results ===
! MC_NUM
! MODELS
| mpstat %usr
! Time (s)
|-
| 11
| 50
| ['R2eff'MODEL_ALL, 'No Rex', 'TSMFK01', 'LM63', 'LM63 3-site', 'CR72', 'CR72 full', 'IT99', 'NS CPMG 2-site 3D', 'NS CPMG 2-site expanded', 'NS CPMG 2-site star']| 2.37single res|9:16:33
|-
| CPU 1
| 11
| 50
| ['R2eff'MODEL_ALL, 'No Rex', 'TSMFK01', 'LM63', 'LM63 3-site', 'CR72', 'CR72 full', 'IT99', 'NS CPMG 2-site 3D', 'NS CPMG 2-site expanded', 'NS CPMG 2-site star']| single res|8:06:44
|-
| CPU 2
| 11
| 50
| ['R2eff'MODEL_ALL, 'No Rex', 'TSMFK01', 'LM63', 'LM63 3-site', 'CR72', 'CR72 full', 'IT99', 'NS CPMG 2-site 3D', 'NS CPMG 2-site expanded', 'NS CPMG 2-site star']| single res|8:18:21
|-
| CPU 2
| 11
| 50
| [MODEL_ALL, single res| 2:17:02|-| CPU 1| 1| C| 78| 16| 11| 50| 'R2eff', 'No Rex', 'TSMFK01', clustering| 71:32:18|-| CPU 1| 2| C| 78| 16| 11| 50| 'LM63R2eff', 'LM63 3-siteNo Rex', 'CR72TSMFK01', clustering| 82:27:13|-| CPU 2| 1| C| 78| 16| 11| 50| 'CR72 fullR2eff', 'IT99No Rex', 'NS CPMG TSMFK01', clustering| 58:45:47|-| CPU 2-site 3D| 24| C| 78| 16| 11| 50| 'R2eff', 'NS CPMG 2-site expandedNo Rex', 'NS CPMG 2-site starTSMFK01']| , clustering|145:01:33
|-
|}
Notes:
# Nr exp. = Nr of experiments = Nr of CPMG frequencies subtracted repetitions and reference spectrums.
# MODEL_ALL = ['R2eff', 'No Rex', 'TSMFK01', 'LM63', 'LM63 3-site', 'CR72', 'CR72 full', 'IT99', 'NS CPMG 2-site 3D', 'NS CPMG 2-site expanded', 'NS CPMG 2-site star']
== See also ==
[[Category:DevelInstallation]][[Category:Development]]
Trusted, Bureaucrats
4,223

edits