-
Notifications
You must be signed in to change notification settings - Fork 23
Supermuc
NOTE: This page is somewhat convoluted and needs cleanup: some of the steps can probably be combined into fewer commands. Specifically, we spend some time setting up the socks-5 tunnel, but don't really use it yet. That is expected to change, and all individual tunnel setups described below are expected to get automated over the socks-tunnel.
- hotfix release for
saga-python
to fix loadleveler handling ofCANDIDATE_HOSTS
- deploy ORTE
- use SSH for non-MPI units (ssh key setup?)
- automate deployment, including tunnel creation
- use Supermuc as remote resource
Supermuc is neigh impossible to be used from a remote site at this point, and is also difficult to use from its login nodes, due to stringent firewall restrictions and complex system setups:
-
gsissh
connections are only allowed togridmuc.lrz.de
, and the connection gets placed on a random (?) login node; - outgoing connections are filtered by target port numbers and are generally not allowed;
- outgoging ssh connections are only allowed to pre-registered IP addresses, and only from the
login08
login node; -
LDL_PRELOAD
is not allowed, which makes it difficult to usepip
over a socks-5 tunnel; - job submission is only allowed from from certain login nodes, as listed below:
Login Node | Compute Nodes |
---|---|
login0[34567] |
phase 1 thin nodes |
login0[1,2] |
phase 1 fat nodes |
login2[123] |
phase 2 Haswell nodes |
This setup makes it (a) difficult to deploy RP, as one currently needs to manually setup several ssh tunnels RP to deploy and function properly, and (b) difficult to run RP, as the submission host is different from the host used to tunnel to the external MongoDB. We discuss below how to address those two issues.
We recommend to use the release versions of the radical stack.
The following instructions assume, for simplicity, the following entries in ~/.ssh/config
on supermuc -- please adapt to your own endpoint.
host 144.76.72.175 radical
user = merzky
hostname = 144.76.72.175
The host radical
is, in this case, used for both installation tunneling and MongoDB hosting. We expect public-key authentication to be configured for the connection from supermuc to that trusted host.
-
grid-proxy-info
: make sure you have a valid X509 - get onto supermuc:
gsissh gridmuc.lrz.de
. You end up on some login node - if not landed on
login08
, hop to that login node which allows outgoing ssh tunnels:ssh login08
- use ssh to create a SOCKS-5 tunnel to a trusted (ie. registered, no other vetting needed) remote host: ssh -NfD 1080 radical
- from that trusted source, fetch a copy of virtualenv (
curl
could be used - but the source is behindfastly
, and that does not play well with socks-5 proxies):scp radical:virtualenv-1.9.tar.gz .
- unpack:
tar zxvf virtualenv-1.9.tar.gz
- load the python module:
module load python/2.7_intel
- make sure the python module actually works:
export LD_LIBRARY_PATH=/lrz/sys/tools/python/intelpython27/lib/
- create a virtualenv:
python virtualenv-1.9/virtualenv.py ve_rp
- activate the VE:
source ve_rp/bin/activate
At this point we have:
- a socks-5 tunnel
- a working Python
- a basic virtualenv (activated)
The next step is to install the radical stack. This is complicated due to the fact that pip
does not really function well over ssh tunnels nor SOCKS-5 proxies. The usual ways to get pip working in this context (use LD_PRELOAD
to apply a tcp socks wrapper on connect
calls) is forbidden on Supermuc. We thus use easy_install
to install the stack, but will have to fix the installation manually thereafter: easy_install handels tunnels, but is otherwise somewhat buggy:
- create an explicit http tunnel:
ssh -NfL 1081:localhost:8888 radical
easy_install radical.pilot
- confirm the install is complete:
$ radical-stack
python : 2.7.12
virtualenv : /home/hpc/pr92ge/di29suh2/ve_rp
radical.utils : 0.45
saga-python : 0.45
radical.pilot : 0.45.1
The current release of radical.saga
misses a feature to support radical.pilot
on loadleveler. Apply this fix to get radical.saga
to work correctly on supermuc: in line 134 of ~/ve_rp/lib/python2.7/site-packages/saga_python-0.45-py2.7.egg/saga/adaptors/loadl/loadljob.py
apply:
saga.job.PROCESSES_PER_HOST,
+ saga.job.CANDIDATE_HOSTS,
saga.job.TOTAL_CPU_COUNT],
We have an installation - but easy_install will not have installed the examples correctly. Fix this with:
mkdir -p ve_rp/share/radical.pilot
cp -R ve_rp/lib/python2.7/site-packages/radical.pilot-0.45.1-py2.7.egg/share/radical.pilot/examples/ ve_rp/share/radical.pilot/
chmod 0755 ve_rp/share/radical.pilot/examples/*.py
- `cd ve_rp/share/radical.pilot/examples
- also (trust me on this), do:
cd ~/ve_rp; ln -s . rp_install; cd -
We now need to set up a separate tunnel for MongoDB, but need to make sure that this is also valid for the compute nodes, so we listen to all interfaces, and then use login08
instead of localhost
for the DB URL:
ssh -NfL \*:1082:localhost:27017 radical
export RADICAL_PILOT_DBURL=mongodb://login08:1082/rp
With this setup, we are ready to run the first example code, at this point only targeting the login node.
mkdir -p $HOME/.radical/pilot/configs
- create
resource_lrz.json
in that directory, with the following content (please replace your username where appropriate):
{
"test_local": {
"description" : "local test",
"notes" : "",
"schemas" : ["fork"],
"fork" : {
"job_manager_endpoint" : "fork://localhost/",
"filesystem_endpoint" : "file://localhost/"
},
"lrms" : "FORK",
"agent_type" : "multicore",
"agent_scheduler" : "CONTINUOUS",
"agent_spawner" : "POPEN",
"agent_launch_method" : "FORK",
"task_launch_method" : "FORK",
"mpi_launch_method" : "MPIEXEC",
"forward_tunnel_endpoint" : "login08",
"pre_bootstrap_1" : ["source /etc/profile",
"source /etc/profile.d/modules.sh",
"module load python/2.7_intel",
"export LD_LIBRARY_PATH=/lrz/sys/tools/python/intelpython27/lib/",
"module unload mpi.ibm", "module load mpi.intel"
],
"valid_roots" : ["/home", "/gpfs/work", "/gpfs/scratch"],
"rp_version" : "installed",
"virtenv" : "/home/hpc/pr92ge/di29suh2/ve_rp/",
"virtenv_mode" : "use",
"python_dist" : "default"
}
}
For the examples to work out of the box, add this section to ./config.json
:
"lrz.test_local" : {
"project" : null,
"queue" : null,
"schema" : null,
"cores" : 2
},
We are now able to run the first example - not yet toward the compute nodes, but locally on the login node, but it will confirm that (i) the installation is viable, and (ii) the DB tunnel setup is correct and usable:
(ve_rp)di29suh2@login08:~/ve_rp/share/radical.pilot/examples> ./00_getting_started.py lrz.test_local
================================================================================
Getting Started (RP version 0.45.1)
================================================================================
new session: [rp.session.login08.di29suh2.017235.0010] \
database : [mongodb://login08:1082/rp] ok
read config ok
--------------------------------------------------------------------------------
submit pilots
create pilot manager ok
create pilot description [lrz.test_local:2] ok
submit 1 pilot(s) . ok
--------------------------------------------------------------------------------
submit units
create unit manager ok
add 1 pilot(s) ok
create 128 unit description(s)
........................................................................
........................................................ ok
submit 128 unit(s)
........................................................................
........................................................ ok
--------------------------------------------------------------------------------
gather results
wait for 128 unit(s)
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++|
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ok
--------------------------------------------------------------------------------
finalize
closing session rp.session.login08.di29suh2.017235.0010 \
close pilot manager \
wait for 1 pilot(s) * ok
ok
close unit manager ok
session lifetime: 40.6s #ok
--------------------------------------------------------------------------------
We can now go the next step and run toward the SM compute nodes. For that we add another set of config sections to ./config.json
(for the examples to work), and to ~/.radical/pilot/configs/resource_lrz.json
(for the supermuc configuration):
For ./config.json
:
"lrz.local" : {
"project" : null,
"queue" : null,
"schema" : null,
"cores" : 32
},
For ~/.radical/pilot/configs/resource_lrz.json
:
"local": {
"description" : "use SM compute nodes",
"notes" : "",
"schemas" : ["ssh"],
"ssh" :
{
"job_manager_endpoint" : "loadl+ssh://login01/?energy_policy_tag=radical_pilot&island_count=1&node_usage=not_shared&network_mpi=sn_all,not_shared,us",
"filesystem_endpoint" : "file://localhost/"
},
"default_queue" : "test",
"lrms" : "LOADLEVELER",
"agent_type" : "multicore",
"agent_scheduler" : "CONTINUOUS",
"agent_spawner" : "POPEN",
"agent_launch_method" : "MPIEXEC",
"task_launch_method" : "MPIEXEC",
"mpi_launch_method" : "MPIEXEC",
"forward_tunnel_endpoint" : "login08",
"pre_bootstrap_1" : ["source /etc/profile",
"source /etc/profile.d/modules.sh",
"module load python/2.7_intel",
"export LD_LIBRARY_PATH=/lrz/sys/tools/python/intelpython27/lib/",
"module unload mpi.ibm", "module load mpi.intel"
],
"valid_roots" : ["/home", "/gpfs/work", "/gpfs/scratch"],
"rp_version" : "installed",
"virtenv" : "/home/hpc/pr92ge/di29suh2/ve_rp/",
"virtenv_mode" : "use",
"python_dist" : "default"
}
With that setup, we also have working submission to compute nodes:
(ve_rp)di29suh2@login08:~/ve_rp/share/radical.pilot/examples> ./00_getting_started.py lrz.local
================================================================================
Getting Started (RP version 0.45.1)
================================================================================
new session: [rp.session.login08.di29suh2.017235.0014] \
database : [mongodb://login08:1082/rp] ok
read config ok
--------------------------------------------------------------------------------
submit pilots
create pilot manager ok
create pilot description [lrz.local:32] ok
submit 1 pilot(s) . ok
--------------------------------------------------------------------------------
submit units
create unit manager ok
add 1 pilot(s) ok
create 128 unit description(s)
........................................................................
........................................................ ok
submit 128 unit(s)
........................................................................
........................................................ ok
--------------------------------------------------------------------------------
gather results
wait for 128 unit(s)
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++|
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ok
--------------------------------------------------------------------------------
finalize
closing session rp.session.login08.di29suh2.017235.0014 \
close pilot manager \
wait for 1 pilot(s) * ok
ok
close unit manager ok
session lifetime: 76.3s ok
--------------------------------------------------------------------------------